1 Executive Summary

In our final group assignment, we will analyze data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an Airbnb in a city. We have chosen Amsterdam for our analysis.

Our prediction is that the average price of a 4-night stays for 2 people in a room rate 4.5 or above in Amsterdam is between 376\(-387\).

The following variables that can be used to determine the prices impact the price positively:

  • The rating of reviews
  • The number of bedrooms and the capacity of the rental home (accommodates)
  • The room type with hotel rooms and entire apartments
  • Super host attracts more clients and can charge a premium
  • Highly available rooms positively impact the price. There might be a bigger market for cheap rooms which are less available.
  • Most popular neighborhood positively impacts the price.

The following variables that can be used to determine the prices impact the price negatively:

  • Some Room Types: Private and Shared rooms. These rooms might be smaller, and clients of these rooms (students) might be willing to pay less for a rental unit. The room type with hotel rooms and entire apartments being the most expensive units.
  • The number of reviews. It is likely because clients see a place with a lot of reviews less exclusive. This might also serve as a proxy for total visits (and cheaper places will likely have less visist)

The following variables were not used because they didn’t seem relevant for our models: - The property types and number of beds capture the same information as room type and the number of bedrooms. - The presence of a license is not relevant. - Last reviews, compared to the average rating, don’t give additional information.

Next potential steps: Using these variables, we were able to explain up to 50% of the price differences. Other variables such as the photos, number of visits, the host presentation can help explain more accurately the price. Areas for further exploration include exploring different cities, seasons in terms of price.

2 Introduction

In our final group assignment we are going to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. We have chosen Amsterdam for our analysis.

Our AirBnB data come from insideairbnb.com; it was originally scraped from airbnb.com.

Below we load the data from Airbnb:

3 Exploratory Data Analysis (EDA)

Let’s first get an overview of our data.

Afterward we will wrangle the data so that we can work with it in our models and plot some of the key relationships.

3.1 Data overview

We use glimpse and skim to gain a general view of the data. We have 74 variables and 16,116 observations. Variables include 24 character variables, 5 date variables, 8 logical variables, and 37 numeric variables. We also have some missing values throughout multiple of the variables in our data.

#Overview of Data
glimpse(listings)
Rows: 16,116
Columns: 74
$ id                                           <dbl> 2818, 20168, 25428, 27886…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021091e+13, 2.021091e+1…
$ last_scraped                                 <date> 2021-09-07, 2021-09-07, …
$ name                                         <chr> "Quiet Garden View Room &…
$ description                                  <chr> "Quiet Garden View Room &…
$ neighborhood_overview                        <chr> "Indische Buurt (\"Indies…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 3159, 59484, 56142, 97647…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Daniel", "Alexander", "J…
$ host_since                                   <date> 2008-09-24, 2009-12-02, …
$ host_location                                <chr> "Amsterdam, Noord-Holland…
$ host_about                                   <chr> "Upon arriving in Amsterd…
$ host_response_time                           <chr> "within an hour", "within…
$ host_response_rate                           <chr> "100%", "100%", "N/A", "8…
$ host_acceptance_rate                         <chr> "100%", "100%", "0%", "10…
$ host_is_superhost                            <lgl> TRUE, FALSE, TRUE, TRUE, …
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Indische Buurt", "Gracht…
$ host_listings_count                          <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_total_listings_count                    <dbl> 1, 2, 2, 1, 2, 2, 2, 1, 1…
$ host_verifications                           <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, FALSE, TRUE, …
$ neighbourhood                                <chr> "Amsterdam, North Holland…
$ neighbourhood_cleansed                       <chr> "Oostelijk Havengebied - …
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 52.36435, 52.36407, 52.37…
$ longitude                                    <dbl> 4.94358, 4.89393, 4.88487…
$ property_type                                <chr> "Private room in rental u…
$ room_type                                    <chr> "Private room", "Private …
$ accommodates                                 <dbl> 2, 2, 3, 2, 2, 1, 2, 3, 2…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1.5 shared baths", "1 pr…
$ bedrooms                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ beds                                         <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2…
$ amenities                                    <chr> "[\"Single level home\", …
$ price                                        <chr> "$59.00", "$106.00", "$12…
$ minimum_nights                               <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ maximum_nights                               <dbl> 28, 365, 120, 730, 1825, …
$ minimum_minimum_nights                       <dbl> 3, 1, 7, 2, 2, 2, 90, 4, …
$ maximum_minimum_nights                       <dbl> 3, 1, 14, 2, 2, 2, 90, 4,…
$ minimum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ maximum_maximum_nights                       <dbl> 1125, 1125, 120, 1125, 18…
$ minimum_nights_avg_ntm                       <dbl> 3.0, 1.0, 13.8, 2.0, 2.0,…
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 120, 1125, 18…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 3, 0, 1, 9, 11, 8, 4, 0, …
$ availability_60                              <dbl> 28, 0, 1, 20, 27, 23, 34,…
$ availability_90                              <dbl> 55, 0, 3, 47, 50, 50, 64,…
$ availability_365                             <dbl> 124, 0, 57, 66, 298, 313,…
$ calendar_last_scraped                        <date> 2021-09-07, 2021-09-07, …
$ number_of_reviews                            <dbl> 280, 339, 5, 223, 353, 49…
$ number_of_reviews_ltm                        <dbl> 2, 0, 0, 4, 19, 17, 0, 0,…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 2, 8, 10, 0, 0, …
$ first_review                                 <date> 2013-08-25, 2014-01-17, …
$ last_review                                  <date> 2019-11-21, 2020-03-27, …
$ review_scores_rating                         <dbl> 4.89, 4.44, 5.00, 4.95, 4…
$ review_scores_accuracy                       <dbl> 4.93, 4.69, 5.00, 4.93, 4…
$ review_scores_cleanliness                    <dbl> 5.00, 4.79, 5.00, 4.96, 4…
$ review_scores_checkin                        <dbl> 4.97, 4.63, 5.00, 4.95, 4…
$ review_scores_communication                  <dbl> 4.97, 4.62, 5.00, 4.92, 4…
$ review_scores_location                       <dbl> 4.68, 4.87, 5.00, 4.90, 4…
$ review_scores_value                          <dbl> 4.81, 4.49, 4.80, 4.80, 4…
$ license                                      <chr> "0363 5F3A 5684 6750 D14D…
$ instant_bookable                             <lgl> TRUE, TRUE, FALSE, TRUE, …
$ calculated_host_listings_count               <dbl> 1, 2, 1, 1, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 1, 2, 0, 1, 2, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 2.86, 3.64, 0.11, 2.14, 4…
skim(listings)
Data summary
Name listings
Number of rows 16116
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 33 37 0 16116 0
name 30 1.00 1 240 0 15766 0
description 223 0.99 1 1000 0 15711 0
neighborhood_overview 5711 0.65 1 1000 0 9495 0
picture_url 0 1.00 61 126 0 15968 0
host_url 0 1.00 38 43 0 14197 0
host_name 5 1.00 1 33 0 5166 0
host_location 35 1.00 2 86 0 560 0
host_about 7066 0.56 1 8920 0 7522 16
host_response_time 5 1.00 3 18 0 5 0
host_response_rate 5 1.00 2 4 0 57 0
host_acceptance_rate 5 1.00 2 4 0 91 0
host_thumbnail_url 5 1.00 55 106 0 14161 0
host_picture_url 5 1.00 57 109 0 14161 0
host_neighbourhood 5713 0.65 4 35 0 68 0
host_verifications 0 1.00 2 158 0 361 0
neighbourhood 5711 0.65 11 59 0 67 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 67 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 21 1.00 6 17 0 26 0
amenities 0 1.00 2 1488 0 14667 0
price 0 1.00 5 9 0 498 0
license 11561 0.28 6 63 0 3624 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
host_since 5 1.00 2008-09-24 2021-08-31 2015-04-07 3235
calendar_last_scraped 0 1.00 2021-09-07 2021-09-08 2021-09-07 2
first_review 2087 0.87 2011-04-19 2021-09-07 2017-10-30 2467
last_review 2087 0.87 2011-06-12 2021-09-07 2019-06-10 2019

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.12 FAL: 14103, TRU: 2008
host_has_profile_pic 5 1 1.00 TRU: 16078, FAL: 33
host_identity_verified 5 1 0.67 TRU: 10720, FAL: 5391
neighbourhood_group_cleansed 16116 0 NaN :
bathrooms 16116 0 NaN :
calendar_updated 16116 0 NaN :
has_availability 0 1 0.96 TRU: 15457, FAL: 659
instant_bookable 0 1 0.23 FAL: 12408, TRU: 3708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.118184e+07 13520629.26 2.818000e+03 1.018020e+07 1.926593e+07 3.107548e+07 5.208280e+07 ▇▇▆▅▂
scrape_id 0 1.00 2.021091e+13 0.00 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 2.021091e+13 ▁▁▇▁▁
host_id 0 1.00 6.976052e+07 90271635.64 3.159000e+03 9.735558e+06 2.974134e+07 8.988321e+07 4.210037e+08 ▇▂▁▁▁
host_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
host_total_listings_count 5 1.00 2.020000e+00 23.34 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.992000e+03 ▇▁▁▁▁
latitude 0 1.00 5.237000e+01 0.02 5.229000e+01 5.236000e+01 5.236000e+01 5.238000e+01 5.243000e+01 ▁▁▇▅▁
longitude 0 1.00 4.890000e+00 0.04 4.760000e+00 4.860000e+00 4.890000e+00 4.910000e+00 5.070000e+00 ▁▆▇▁▁
accommodates 0 1.00 2.840000e+00 1.31 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▃▁▁▁
bedrooms 898 0.94 1.530000e+00 0.95 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 5.000000e+01 ▇▁▁▁▁
beds 97 0.99 1.760000e+00 1.47 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 3.300000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 3.990000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 6.081500e+02 540.67 1.000000e+00 2.100000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▇▁▁▇▁
minimum_minimum_nights 3 1.00 3.950000e+00 20.99 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_minimum_nights 3 1.00 4.150000e+00 21.03 1.000000e+00 2.000000e+00 3.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
minimum_maximum_nights 3 1.00 6.855200e+02 532.48 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 1.825000e+03 ▆▁▁▇▁
maximum_maximum_nights 3 1.00 2.672465e+05 23924509.45 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 3 1.00 4.030000e+00 21.00 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 1.100000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 3 1.00 2.666400e+05 23870381.65 1.000000e+00 2.800000e+01 1.125000e+03 1.125000e+03 2.142625e+09 ▇▁▁▁▁
availability_30 0 1.00 4.030000e+00 8.70 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+01 ▇▁▁▁▁
availability_60 0 1.00 8.840000e+00 18.14 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 6.000000e+01 ▇▁▁▁▁
availability_90 0 1.00 1.432000e+01 28.51 0.000000e+00 0.000000e+00 0.000000e+00 5.000000e+00 9.000000e+01 ▇▁▁▁▁
availability_365 0 1.00 5.532000e+01 107.91 0.000000e+00 0.000000e+00 0.000000e+00 4.700000e+01 3.650000e+02 ▇▁▁▁▁
number_of_reviews 0 1.00 2.465000e+01 56.71 0.000000e+00 2.000000e+00 8.000000e+00 2.200000e+01 8.770000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 1.390000e+00 7.48 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.220000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 3.300000e-01 2.03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.370000e+02 ▇▁▁▁▁
review_scores_rating 2087 0.87 4.690000e+00 0.67 0.000000e+00 4.670000e+00 4.860000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 2301 0.86 4.810000e+00 0.35 0.000000e+00 4.750000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 2300 0.86 4.700000e+00 0.44 0.000000e+00 4.600000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 2309 0.86 4.850000e+00 0.31 0.000000e+00 4.820000e+00 4.950000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 2304 0.86 4.870000e+00 0.30 0.000000e+00 4.850000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 2309 0.86 4.730000e+00 0.33 0.000000e+00 4.600000e+00 4.810000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 2309 0.86 4.600000e+00 0.39 0.000000e+00 4.500000e+00 4.670000e+00 4.830000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.640000e+00 2.40 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.100000e+00 1.87 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 3.000000e+01 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 4.900000e-01 1.53 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 1.000000e-02 0.13 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e+00 ▇▁▁▁▁
reviews_per_month 2087 0.87 6.800000e-01 1.72 1.000000e-02 1.100000e-01 2.700000e-01 6.200000e-01 1.078400e+02 ▇▁▁▁▁

First out of interest we would like to plot a map of the raw dataset.

By plotting a cluster map that shows the number of flat listed for rent, we want to visualise the concentration of listings in the area of Amsterdam

#Map the concentration of listing in Amsterdam
listings %>% 
  leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1,
                   fillOpacity = 0.3, 
                   popup = ~listing_url,
                   label = ~property_type,
                   clusterOptions = markerClusterOptions()
                   )

From this map, we concluded that more properties are available in the central area of Amsterdam. We would like to further investigate the relationship between the price of listings and the locations of properties.

We select some variables for our analysis which we believe would be relevant to determine the impact on prices.

We will hypothesize that the following are key variables which will will be using later on

  • price: cost per night (variable we will be use to predict)

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • neighbourhood*: three variables on a few major neighbourhoods in each city

  • accomodations: How many people can stay in the room per night

  • beds: Number of beds

  • host_is_superhost: Whether the host has superhost status

  • minimum_nights: Minimum nights to stay in the Airbnb

  • bedrooms: Number of bedrooms

  • reviews_per_month: Number of reviews per month

  • availability_30: Available stays next 30 days

  • license: license number

  • instant_bookable: Whether the room can be instantly booked

If you are interest in further information is is available here

Let’s first filter for our variables of interest

#Select the variables for further exploration

listings2 <- listings %>% 
  
  select(c("host_since", 
           "host_is_superhost", 
           "neighbourhood_cleansed", 
           "property_type", 
           "room_type", 
           "accommodates",
           "price",
           "review_scores_rating", 
           "number_of_reviews",
           "minimum_nights", 
           "reviews_per_month",
           "bedrooms","beds",
           "availability_30",
           "last_review",
           "license",
           "instant_bookable"))

# We want to choose the "bathrooms" as well, however,this variable has no values

3.2 Data wrangling

Now let’s ensure we can work with our data.

First lets briefly look at our new dataset

skim(listings2)
Data summary
Name listings2
Number of rows 16116
Number of columns 17
_______________________
Column type frequency:
character 5
Date 2
logical 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1.00 4 38 0 22 0
property_type 0 1.00 3 35 0 67 0
room_type 0 1.00 10 15 0 4 0
price 0 1.00 5 9 0 498 0
license 11561 0.28 6 63 0 3624 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 5 1.00 2008-09-24 2021-08-31 2015-04-07 3235
last_review 2087 0.87 2011-06-12 2021-09-07 2019-06-10 2019

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.12 FAL: 14103, TRU: 2008
instant_bookable 0 1 0.23 FAL: 12408, TRU: 3708

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1.00 2.84 1.31 0.00 2.00 2.00 4.00 16.00 ▇▃▁▁▁
review_scores_rating 2087 0.87 4.69 0.67 0.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1.00 24.65 56.71 0.00 2.00 8.00 22.00 877.00 ▇▁▁▁▁
minimum_nights 0 1.00 3.99 20.99 1.00 2.00 2.00 3.00 1100.00 ▇▁▁▁▁
reviews_per_month 2087 0.87 0.68 1.72 0.01 0.11 0.27 0.62 107.84 ▇▁▁▁▁
bedrooms 898 0.94 1.53 0.95 1.00 1.00 1.00 2.00 50.00 ▇▁▁▁▁
beds 97 0.99 1.76 1.47 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1.00 4.03 8.70 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

We see some issues that we will try and resolve.

3.2.1 Price

First:

Price is sometimes given as a character string and we would like it to be a double variable.

Below we change the price to a double, i.e. a numeric variable.

#Change the 'price' to numeric
listings2 <- listings2 %>% 
  mutate(price = parse_number(price))

#Use `typeof(listing2$price)` to confirm that `price` is now stored as a number.
typeof(listings2$price)
[1] "double"

3.2.2 Licenses

We then want to impute the missing value of “license” with logical judgement instead of a license number.

This will make our analysis simply focus on whether people have a license or not.

#Impute the missing value of “license”
listings2<-listings2 %>% 
  
  #Add yes or no depending on whether variable is missing
  mutate(license = ifelse(is.na(license), "no","yes"))


#Let's check our data by filtering for our license
License_check <- listings2 %>% 
  select(license)


#Let's take a glimpse at the results.
glimpse(License_check)
Rows: 16,116
Columns: 1
$ license <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "y…

The code works as hoped.

3.2.3 Property types

Next, we look at the variable property_type. We use the count function to determine how many categories there are and their frequency.

This will be useful later to summarize our property types in larger categories as there will be more observations and less variables in the regression.

#Determine how many categories there are and their frequency
count_prop <- listings2 %>% 
  
  #Select variable of interest - i.e. property types
  select(property_type) %>% 
  
  #Count Number of each property type
  count(property_type) %>% 
  
  #Arrange by most frequent category
  arrange(desc(n)) %>% 
  
  #Create a new column for proportions
  mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up 

#Print Table
count_prop
property_typenpercentage
Entire rental unit977660.7    
Private room in rental unit185011.5    
Entire residential home10456.48   
Entire townhouse4292.66   
Entire condominium (condo)3542.2    
Private room in bed and breakfast3161.96   
Private room in residential home2961.84   
Entire loft2691.67   
Houseboat1861.15   
Boat1811.12   
Room in boutique hotel1741.08   
Private room in townhouse1560.968  
Entire serviced apartment1270.788  
Private room in houseboat1120.695  
Private room in guest suite1100.683  
Room in hotel1070.664  
Private room in condominium (condo)970.602  
Private room in boat940.583  
Private room in loft530.329  
Room in bed and breakfast360.223  
Entire guest suite340.211  
Entire villa300.186  
Shared room in rental unit240.149  
Private room in hostel230.143  
Room in aparthotel210.13   
Entire guesthouse200.124  
Room in hostel200.124  
Room in serviced apartment160.0993 
Private room150.0931 
Private room in guesthouse140.0869 
Entire place110.0683 
Private room in serviced apartment100.0621 
Entire cottage90.0558 
Shared room in hostel80.0496 
Tiny house80.0496 
Private room in farm stay70.0434 
Private room in tiny house60.0372 
Private room in villa60.0372 
Entire cabin50.031  
Barn40.0248 
Entire bed and breakfast40.0248 
Entire bungalow40.0248 
Private room in casa particular40.0248 
Camper/RV30.0186 
Entire home/apt30.0186 
Floor30.0186 
Private room in cabin30.0186 
Shared room in houseboat30.0186 
Shared room in residential home30.0186 
Campsite20.0124 
Entire chalet20.0124 
Private room in dome house20.0124 
Private room in earth house20.0124 
Private room in floor20.0124 
Private room in island20.0124 
Shared room in bed and breakfast20.0124 
Tower20.0124 
Yurt20.0124 
Bus10.00621
Cave10.00621
Earth house10.00621
Private room in bungalow10.00621
Private room in nature lodge10.00621
Room in casa particular10.00621
Shared room in boat10.00621
Shared room in loft10.00621
Tipi10.00621

The top 4 common property types are “Entire rental unit”,“Private room in rental unit”,“Entire residential home” and “Entire townhouse”. The top 4 types make up 60.66%, 11.48%, 6.48%, and 2.66% of the total listings

This will also make it easier for our regressions at a later stage given this is categorical variables.

#Simplify the "property_type" 
listings2 <- listings2 %>%
  
  #Create new names called other for properties not in the top four categories
  mutate(prop_type_simplified = case_when(
    
    #Specifying names to be kept
    property_type %in% c("Entire rental unit",
                         "Private room in rental unit", 
                         "Entire residential home", 
                         "Entire townhouse") ~ property_type, 
    
    #Specifying ot leave remaining in other
    TRUE ~ "Other"))

#We check the `prop_type_simplified` was correctly made
listings2 %>%
  
  #Count each category
  count(property_type, prop_type_simplified) %>%
  
  #Arrange by largest category
  arrange(desc(n))    
property_typeprop_type_simplifiedn
Entire rental unitEntire rental unit9776
Private room in rental unitPrivate room in rental unit1850
Entire residential homeEntire residential home1045
Entire townhouseEntire townhouse429
Entire condominium (condo)Other354
Private room in bed and breakfastOther316
Private room in residential homeOther296
Entire loftOther269
HouseboatOther186
BoatOther181
Room in boutique hotelOther174
Private room in townhouseOther156
Entire serviced apartmentOther127
Private room in houseboatOther112
Private room in guest suiteOther110
Room in hotelOther107
Private room in condominium (condo)Other97
Private room in boatOther94
Private room in loftOther53
Room in bed and breakfastOther36
Entire guest suiteOther34
Entire villaOther30
Shared room in rental unitOther24
Private room in hostelOther23
Room in aparthotelOther21
Entire guesthouseOther20
Room in hostelOther20
Room in serviced apartmentOther16
Private roomOther15
Private room in guesthouseOther14
Entire placeOther11
Private room in serviced apartmentOther10
Entire cottageOther9
Shared room in hostelOther8
Tiny houseOther8
Private room in farm stayOther7
Private room in tiny houseOther6
Private room in villaOther6
Entire cabinOther5
BarnOther4
Entire bed and breakfastOther4
Entire bungalowOther4
Private room in casa particularOther4
Camper/RVOther3
Entire home/aptOther3
FloorOther3
Private room in cabinOther3
Shared room in houseboatOther3
Shared room in residential homeOther3
CampsiteOther2
Entire chaletOther2
Private room in dome houseOther2
Private room in earth houseOther2
Private room in floorOther2
Private room in islandOther2
Shared room in bed and breakfastOther2
TowerOther2
YurtOther2
BusOther1
CaveOther1
Earth houseOther1
Private room in bungalowOther1
Private room in nature lodgeOther1
Room in casa particularOther1
Shared room in boatOther1
Shared room in loftOther1
TipiOther1
#Removing the old property type
listings2 <- subset(listings2, select = -property_type)

Our table comes out as expected and we can see that all smaller property types are in the other category.

#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>% 
  group_by(prop_type_simplified) %>% 
  
  #Calculate weight's summary statistics for people exercising at least 3 times a week 
  
  # calculate mean, SD, count, SE, lower/upper 95% CI
  summarise(
    average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(price,na.rm=TRUE), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 

formula_ci
prop_type_simplifiedaverage_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
Entire rental unit160  16697761.961.683.29157  163  
Entire residential home208  16910451.965.2310.3 198  219  
Entire townhouse218  1344291.976.4712.7 206  231  
Other160  20830161.963.797.44152  167  
Private room in rental unit93.113618501.963.176.2286.899.3
ggplot(formula_ci, aes(x=average_price, y=prop_type_simplified, color=prop_type_simplified)) +

#geom_errorbar function allows us to show the two bars with confidence intervals

geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+ 
  
geom_point(aes(x=average_price),size=1)+

  
theme_bw() +
  
theme(legend.position = "none",axis.title.y=element_blank())+
  
  labs(title = "Confidence interval for average price per property type",
       subtitle = "95% confidence intervals overlap",
       x = "Average Price"
       ) +
  NULL

The property type seems to have a significant effect on prices. The length of the intervals mainly vary due to the differences in sample size. The most common property type (Entire rental unit) has a small interval because it has the larger sample. We can estimate quite aaccuratly the average price in this category.

3.2.4 Neighbourhood

Next, we look at the variable neighbourhood_cleansed. We use the count function to determine how many categories there are and their frequency. The top 5 common neighbourhoods are “De Baarsjes - Oud-West”,“De Pijp - Rivierenbuurt”,“Centrum-West”, “Centrum-Oost” and “WesterPark”. The top 5 types make up 16.8%, 12.3%, 10.8%, 8.51%%, and 7.48% of the total listings

#Determine how many categories there are and their frequency
count_prop <- listings2 %>% 
  
  #Selecting variable of interest
  select(neighbourhood_cleansed) %>% 
  
  #Counting observations 
  count(neighbourhood_cleansed) %>% 
  
  #Arrange by observations
  arrange(desc(n)) %>% 
  
  #In Percentage
  mutate(percentage=n/sum(n)*100) #Determine the proportion of the total listings the top 4 types make up 

#Print Table
count_prop
neighbourhood_cleansednpercentage
De Baarsjes - Oud-West270116.8  
De Pijp - Rivierenbuurt198612.3  
Centrum-West174610.8  
Centrum-Oost13728.51 
Westerpark12057.48 
Zuid11967.42 
Oud-Oost10326.4  
Bos en Lommer9315.78 
Oostelijk Havengebied - Indische Buurt7334.55 
Oud-Noord5183.21 
Watergraafsmeer4762.95 
IJburg - Zeeburgereiland3962.46 
Slotervaart3492.17 
Noord-West3232    
Noord-Oost2271.41 
Buitenveldert - Zuidas2171.35 
Geuzenveld - Slotermeer2001.24 
De Aker - Nieuw Sloten1160.72 
Osdorp1140.707
Gaasperdam - Driemond1070.664
Bijlmer-Centrum910.565
Bijlmer-Oost800.496

Looking into this we see that there are many more high observation neighbourhoods compared to property types.

Therefore, we believe it makes more sense to split the neighbourhood by their relative prices.

We do this below

#Setting neighbourhood by price
neighbourhood_group <- listings2 %>% 
  
  #Variable of interest
  group_by(neighbourhood_cleansed) %>% 
  
  #Getting median price
  summarize(price_median = median(price)) %>% 
  
  #Arranging by highest price
  arrange(desc(price_median))

#Print
neighbourhood_group
neighbourhood_cleansedprice_median
Centrum-Oost150
Centrum-West150
IJburg - Zeeburgereiland146
De Pijp - Rivierenbuurt140
Zuid134
De Baarsjes - Oud-West129
Oud-Oost129
Westerpark129
Oud-Noord125
Watergraafsmeer125
Noord-West120
Noord-Oost119
Oostelijk Havengebied - Indische Buurt114
Buitenveldert - Zuidas112
Bos en Lommer110
Geuzenveld - Slotermeer100
Slotervaart100
De Aker - Nieuw Sloten98
Osdorp91
Gaasperdam - Driemond85
Bijlmer-Centrum83
Bijlmer-Oost79

We see that some central neighbourhoods are more expensive, which logically makes sense given this is a large city and that is where tourists and other activities are.

Let’s categorize neighbourhood by price

#Simplify the "property_type" 
listings2 <- listings2 %>%
  
  #Adding new variable names
  mutate(neighbourhood_cleansed = case_when(
                                  neighbourhood_cleansed == "Centrum-Oost" ~ "Top 5",
                                  neighbourhood_cleansed == "Centrum-West" ~ "Top 5",
                                  neighbourhood_cleansed == "IJburg - Zeeburgereiland" ~ "Top 5",
                                  neighbourhood_cleansed == "Zuid" ~ "Top 5",
                                  neighbourhood_cleansed == "De Pijp - Rivierenbuurt" ~ "Top 5",
                                  neighbourhood_cleansed == "Oud-Noord" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Watergraafsmeer" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Oud-Oost" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Westerpark" ~ "Top 6-10",
                                  neighbourhood_cleansed == "De Baarsjes - Oud-West" ~ "Top 6-10",
                                  neighbourhood_cleansed == "Noord-West" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Noord-Oost" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Oostelijk Havengebied - Indische Buurt" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Buitenveldert - Zuidas" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Bos en Lommer" ~ "Top 11-15",
                                  neighbourhood_cleansed == "Geuzenveld - Slotermeer" ~ "Remaining",
                                  neighbourhood_cleansed == "Slotervaart" ~ "Remaining",
                                  neighbourhood_cleansed == "De Aker - Nieuw Sloten" ~ "Remaining",
                                  neighbourhood_cleansed == "Osdorp" ~ "Remaining",
                                  neighbourhood_cleansed == "Gaasperdam - Driemond" ~ "Remaining",
                                  neighbourhood_cleansed == "Bijlmer-Centrum" ~ "Remaining",
                                  neighbourhood_cleansed == "Bijlmer-Oost" ~ "Remaining"))
                                  
                                
#Check the `prop_type_simplified` was correctly made
listings2 %>%
  
  #Counting by number of observations
  count(neighbourhood_cleansed) %>%
  
  #Displaying in descending order
  arrange(desc(n))    
neighbourhood_cleansedn
Top 56696
Top 6-105932
Top 11-152431
Remaining1057

3.2.5 Minimum nights

Let’s first look into how minimum nights look at Airbnb Amsterdam.

#Find out the most common values for the variable `minimum_nights`
listings2 %>%
  
  #Count Observations
  count(minimum_nights) %>%
  
  #Arrange by observations
  arrange(desc(n))
minimum_nightsn
2      5993
3      3918
1      2658
4      1367
5      895
7      481
6      234
14      109
10      92
30      63
8      33
21      32
20      30
25      19
12      18
15      18
28      16
60      15
9      12
90      11
13      7
59      7
180      7
11      4
19      4
23      4
29      4
50      4
58      4
100      4
16      3
27      3
31      3
200      3
300      3
365      3
18      2
26      2
42      2
45      2
99      2
150      2
1e+03      2
24      1
33      1
37      1
38      1
63      1
70      1
75      1
80      1
89      1
94      1
95      1
120      1
181      1
183      1
186      1
222      1
240      1
500      1
999      1
1e+03      1
1.1e+031

By looking into the vairable minimum_nights, we found out that airbnb most commonly required people to stay for at least 2 nights. The majority of users tend to spend less than a week, which indicates that Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels..

There are also some variables that stand out i.e. minimum nights of 1,100 days (3 years minimum rent). These individuals are probably looking for long term renters.

We only want to include listings in our regression analysis that are intended for travel purposes, i.e. no more than 4 days.

#Filter the Airbnb data so that it only includes observations with `minimum_nights <= 4`
listings2 <- listings2 %>% 
  filter(minimum_nights <= 4)

3.2.6 Dropping NA’s

We then skim the dataset to see if further data wrangling is required.

skim(listings2)
Data summary
Name listings2
Number of rows 13936
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 5 1.00 2008-09-24 2021-08-31 2015-05-04 3121
last_review 1602 0.89 2012-05-12 2021-09-07 2019-06-16 1920

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 5 1 0.13 FAL: 12118, TRU: 1813
instant_bookable 0 1 0.24 FAL: 10656, TRU: 3280

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1.00 2.82 1.31 0.00 2.00 2.00 4.00 16.00 ▇▃▁▁▁
price 0 1.00 155.55 164.51 0.00 96.00 130.00 180.00 8000.00 ▇▁▁▁▁
review_scores_rating 1602 0.89 4.69 0.65 0.00 4.67 4.85 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1.00 26.39 59.57 0.00 2.00 8.00 23.00 877.00 ▇▁▁▁▁
minimum_nights 0 1.00 2.29 0.88 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 1602 0.89 0.72 1.78 0.01 0.11 0.28 0.65 107.84 ▇▁▁▁▁
bedrooms 814 0.94 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 85 0.99 1.74 1.49 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1.00 4.07 8.71 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

As we can see, review_scores_rating, reviews_per_month, host_is_superhost, bedrooms and beds still have missing values. We will drop unreviewed units for now and as these units will likely be less accurate in terms of price / review ratings. Similarly for bedrooms and beds, we drop missing values.

#Dropping values

listings2 <-  listings2 %>% 
  
  #Choosing which variables to drop NAs from
  drop_na(bedrooms,beds,host_is_superhost,review_scores_rating,reviews_per_month)

#Skimming to confirm code works as intended. 
skim(listings2)
Data summary
Name listings2
Number of rows 11586
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-09-24 2021-08-11 2015-03-04 2921
last_review 0 1 2012-05-12 2021-09-07 2019-06-14 1891

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.14 FAL: 9945, TRU: 1641
instant_bookable 0 1 0.22 FAL: 9018, TRU: 2568

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1 2.85 1.30 1.00 2.00 2.00 4.00 16.00 ▇▁▁▁▁
price 0 1 154.60 143.04 4.00 99.00 130.00 180.00 8000.00 ▇▁▁▁▁
review_scores_rating 0 1 4.70 0.65 0.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1 29.16 61.35 1.00 4.00 10.00 26.00 866.00 ▇▁▁▁▁
minimum_nights 0 1 2.32 0.87 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 0 1 0.69 1.79 0.01 0.11 0.28 0.62 107.84 ▇▁▁▁▁
bedrooms 0 1 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 0 1 1.76 1.51 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1 3.89 8.47 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

3.2.7 Dropping Rating 0

We also know that ratings on Airbnb are 1-5, therefore there must be an issue with reviews that are 0 out of 5.

Let’s remove these from our data

#Dropping reviews at 0
listings2 <- listings2 %>% 
  filter(review_scores_rating > 0)

3.2.8 Final skim of data

Let’s take a final skim of the data

skim(listings2)
Data summary
Name listings2
Number of rows 11426
Number of columns 17
_______________________
Column type frequency:
character 4
Date 2
logical 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
neighbourhood_cleansed 0 1 5 9 0 4 0
room_type 0 1 10 15 0 4 0
license 0 1 2 3 0 2 0
prop_type_simplified 0 1 5 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 0 1 2008-09-24 2021-08-11 2015-03-01 2911
last_review 0 1 2012-05-12 2021-09-07 2019-06-16 1877

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 0 1 0.14 FAL: 9785, TRU: 1641
instant_bookable 0 1 0.22 FAL: 8889, TRU: 2537

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
accommodates 0 1 2.85 1.30 1.00 2.00 2.00 4.00 16.00 ▇▁▁▁▁
price 0 1 153.68 122.98 4.00 99.00 130.00 180.00 7999.00 ▇▁▁▁▁
review_scores_rating 0 1 4.77 0.33 1.00 4.67 4.86 5.00 5.00 ▁▁▁▁▇
number_of_reviews 0 1 29.55 61.69 1.00 4.00 11.00 26.00 866.00 ▇▁▁▁▁
minimum_nights 0 1 2.32 0.87 1.00 2.00 2.00 3.00 4.00 ▃▇▁▅▂
reviews_per_month 0 1 0.70 1.80 0.01 0.12 0.28 0.63 107.84 ▇▁▁▁▁
bedrooms 0 1 1.50 0.85 1.00 1.00 1.00 2.00 15.00 ▇▁▁▁▁
beds 0 1 1.76 1.51 0.00 1.00 1.00 2.00 33.00 ▇▁▁▁▁
availability_30 0 1 3.89 8.47 0.00 0.00 0.00 0.00 30.00 ▇▁▁▁▁

We now have wrangled our data sufficiently and have: - 4 character variables ready for analysis with not too many categories. - 2 Date variables with no missing values - 2 Logical variables with no missing variables - 9 range of numeric variables with no missing values.

We also have 11.4 thousand observations.

We will now explore how these variables affect price.

3.3 Visualisations

By visualising the relationship between price and the variables that we believe will affect the price substantially, we hope to identify a noticeable correlation between price and each of those variables and use statistical modelling to further investigate their relationships and provide possible explanation.

3.3.1 Overview of Price

We first take a look at the price.

#Overview of price
favstats(listings2$price)
minQ1medianQ3maxmeansdnmissing
4991301808e+03154123114260

We see that price ranges from 4 to 8,000 per night with a mean of 154 and median of 130 (i.e. some left skew)

We create the density plot of price.

#Create a density plot of prices
ggplot(listings2,aes(x=price))+
  
  #Using a density plot
  geom_density()+
  
  #Adding useful titles
  labs(title="Density plot of listing Airbnb prices",
       x="Price",
       y="Density") + 
  
  #Minimal theme
  theme_minimal()

#Create a density plot of log of prices
ggplot(listings2,aes(x=log(price)))+
  
  #Density plot
  geom_density()+
  
  #Useful Labes
  labs(title="Density plot of listing Airbnb (log) prices",
       x="Log Price",
       y="Density") + 
  
  #Minimal Theme
  theme_minimal()

From here forward we will use Log which is more normally distributed. We can see that there are some peaks and lows on the graph which we connect to well-known disposition of people to round numbers, i.e. setting price to be USD 140 instead of USD 137 or USD 142. We assume that the graph would be smooth and almost perfectly normally distributed without this feature

3.3.2 Location/ Neighbourhood

First, we skimmed the neighbourhood_cleansed data to analyse the distribution of price of listings depends on neighbourhood. From the table below, we found out that median price will better suits the purpose of our research than the mean price because the distribution of price of each neighbourhood is greaatly skewed with extreme outliers. For example all neighbourhoods have lower median than mean price while the most expensive property costs USD8,000, causing a large standard deviation. Hence, standard deviation is unusually large among neighbourhood.

#Analyse the distribution of price of listings depends on neighbourhood
favstats(price~neighbourhood_cleansed,data=listings2) %>% 
  
  #Arrange by count
  arrange(desc(n)) 
neighbourhood_cleansedminQ1medianQ3maxmeansdnmissing
Top 541041492002e+03172113  46470
Top 6-1010991301758e+03151149  43490
Top 11-15258911815080012866.517050
Remaining22659513699911384.27250

We then plot a density graph to see the price distribution of the most popular neighbourhood.

#Create the dataset of the top neighbourhood and the price
listings_neigh<-listings2 %>% 
  
  #Group by variable we are looking at
  group_by(neighbourhood_cleansed) %>% 
  
  #ANd get the price
  summarize(price=price)

listings_neigh$new <- factor(listings_neigh$neighbourhood_cleansed, levels = c("Top 5", "Top 6-10", "Top 11-15", "Remaining"))

#Plot the density graph
ggplot(listings_neigh,aes(x=log(price), y = desc(new), fill = new))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Difference in price by region expensiveness",
      subtitle="Density plot for prices in different region groups",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #removing y-axis 
    theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  
  #Title of Legend
  guides(fill=guide_legend(title="Neighbourhood regions"))+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

The graph shows that the neighbourhoods vary in price and will likely be a good predictor for our analysis later on.

Let’s look at the median price by neighbourhood as well:

#Calculate the median price of different neighborhood
median_per_neighborhood <- listings2 %>% 
  
  #Grouping by variable
  group_by(neighbourhood_cleansed) %>% 
  
  #We want median price
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_neighborhood,
       
       #Price versus neighbourhood
       aes(x = reorder(neighbourhood_cleansed, median_price),
           y= median_price)) +
  
  #Columns chart
  geom_col(fill = "skyblue") +
  
  #USeful labes
  labs(
    title = "Median price per night per Neighborhood",
    x="Neighborhood",
    y="Median Price"
    ) +
  
  #To flip axis
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

Plotting on a median price basis indicates that these prices are still indeed impacted by how close they are to center.

3.3.3 Number of Bedrooms

We skimmed our data to summarise the price distribution based on number of bedrooms.

From the table below, we noticed that listings with more than five bedrooms have less than 10 samples, therefore, we decided to only include properties with up to five bedrooms. Furthermore, we found out that median price will better suit the purpose of our research than the mean price because the distribution of price based on number of bedrooms is greatly skewed due to extreme outliers. Therefore, we believe that the median price gives better understanding of the relationship between the price and the number of bedrooms

#Analyse the distribution of price of listings depends on number of bedrooms
favstats(price~bedrooms, data=listings2)

bedroomsminQ1medianQ3maxmeansdnmissing
14861121498e+03       124113  74200
2421301722252e+03       18994.928280
3291501992701.92e+03232143  8530
4251942503501.16e+03293165  2600
5125247330478814       383191  400
650409443493550       389197  50
7487487487487487       487  10
850162530561811       422308  60
9557557557557557       557  10
10443696725797857       704159  50
1283838383857       212316  60
15120120120120120       120  10
The graph below shows that the listings having more bedrooms have a higher median price.

We would like to plot this information

#Filter the bedrooms under 5 to remove outliers
listings2 <- listings2 %>% 
  filter(bedrooms<=5)

#Median calculations
median_per_bedroom <- listings2 %>% 
  
  #By Bedroom
  group_by(bedrooms) %>% 
  
  #Summarized by median price
  summarise(median_price = median(price))

#Plotting the graph
ggplot(median_per_bedroom,
       
       #Bedroom by price
       aes(x = reorder(bedrooms,median_price),
           
           #Price on Y Axis
           y= median_price)) +
  
  #in Column format
  geom_col(fill = "orchid3") +
  
  #With useful labelss
  labs(
    title = "Median price per night per bedrooms",
    x="Number of Bedrooms",
    y="Median Price"
    ) +
  
  #To flip axis
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

We see a strong relationship with number of bedrooms and price.

We assume number of bedrooms will have high correlation with accommodation and number of beds and will therefore leave those two variables out for now.

3.3.4 Ratings

We skimmed our data to summarise the price distribution based on the rating score.

#Analyse the distribution of price of listings depends on the rating score
favstats(price~review_scores_rating,data=listings2)
review_scores_ratingminQ1medianQ3maxmeansdnmissing
14761.299.5163  400       145  120       140
26079.892.5116  225       112  54.6     120
2.33121121  121  121  121       121         10
2.56585  105  122  140       103  37.5     30
2.75130130  130  130  130       130         10
2.8149149  149  149  149       149         10
340100  130  166  579       146  89.4     630
3.256567.570  75  80       71.77.64    30
3.29120120  120  120  120       120         10
3.3350120  130  150  200       130  54.3     50
3.4133133  133  133  133       133         10
3.530100  117  145  345       138  80.8     230
3.5691112  133  154  175       133  59.4     20
3.578484  84  84  84       84         10
3.67085  100  138  176       115  54.6     30
3.63125134  144  154  163       144  26.9     20
3.672983.8115  165  286       133  69.6     160
3.69129129  129  129  129       129         10
3.71124148  172  196  220       172  67.9     20
3.7595135  150  219  373       180  87.9     90
3.789090  90  90  90       90         10
3.896126  152  187  262       163  59.2     60
3.8383103  114  122  130       110  20.1     40
3.866573.882.591.2100       82.524.7     20
3.8860102  144  322  500       235  234       30
3.8980122  165  208  250       165  120       20
42590  120  150  999       137  88.2     2910
4.036066.272.578.885       72.517.7     20
4.04138138  138  138  138       138         10
4.054859  70  120  171       96.365.6     30
4.0777100  145  150  155       125  34.8     50
4.0896100  101  125  760       236  293       50
4.095882  110  132  140       104  37.8     40
4.149174  298  349  400       249  181       30
4.116087  100  160  230       127  56.6     90
4.137880  121  199  461       164  111       130
4.146585  130  140  760       164  182       130
4.155483.595.5120  180       105  43.5     60
4.16140140  140  140  140       140         10
4.1785100  140  160  599       162  119       170
4.1837108  132  167  200       132  52.9     80
4.1975112  150  225  300       175  115       30
4.24979.2104  137  400       119  70.5     300
4.2150100  150  165  180       127  68.1     30
4.2263111  146  198  400       161  87.8     140
4.234985.5100  131  453       135  111       110
4.243463.884.5141  214       102  60.3     80
4.253581  103  144  599       131  98       580
4.266179.598  116  135       98  52.3     20
4.275285.2100  121  297       114  64.4     120
4.286069.879.589.299       79.527.6     20
4.294688.5118  146  229       123  46.6     180
4.33984.297.5135  200       109  46.4     160
4.315666  89  100  319       109  73.2     130
4.323347.255  112  130       74.538.7     80
4.334081.5118  152  1.15e+03139  120       1240
4.344555  60  68  139       73.437.6     50
4.354585  147  195  275       148  76.4     130
4.367099.8117  150  202       123  35.9     160
4.376685  99  100  125       94.319.2     70
4.3835108  130  171  252       137  50.8     280
4.395973.5130  176  8e+03       840  2.38e+03110
4.42279.8124  168  740       143  108       520
4.413385.5125  160  270       133  66.2     150
4.424289.8125  148  250       127  53.7     220
4.4334100  122  161  430       135  65.5     480
4.445894.2115  149  283       128  54.7     240
4.4538105  143  206  410       166  93.3     250
4.464686  130  186  250       141  65.8     220
4.474488.8102  158  500       148  111       240
4.4865105  130  158  329       154  81.6     180
4.495880  90  118  188       99.836.1     110
4.52990  120  160  2e+03       141  120       4300
4.514391.5122  162  264       135  77.3     60
4.526880  103  140  707       151  155       170
4.534582  105  150  240       118  49.1     410
4.543366  96  145  286       106  52.6     330
4.552880  113  163  389       131  74.1     510
4.564798.8124  152  250       131  49.1     640
4.572581.5108  150  740       141  111       940
4.586598.8130  162  895       162  134       480
4.595081.8130  193  464       146  88       340
4.64095  125  164  599       140  81.9     1360
4.615889.5110  172  695       149  115       350
4.622675  107  150  450       122  74.7     420
4.632991.5120  164  550       143  85.5     1150
4.642689.8130  186  1.07e+03166  149       680
4.653291.2133  188  367       147  82.5     420
4.663063.5109  180  399       131  90.3     230
4.671790  120  164  599       138  75.7     4350
4.684098.5125  163  800       154  110       630
4.693090  115  168  450       140  77.7     880
4.73599  130  172  350       140  65.3     950
4.714299  120  153  400       135  59.1     1650
4.723384.5117  144  300       128  65.1     560
4.733085  120  170  800       144  101       1240
4.743083  120  166  500       141  86.4     710
4.753795  129  180  829       148  87.4     3600
4.7634122  145  180  500       158  78.7     730
4.775095  125  162  729       146  92.8     990
4.782596  130  190  1.16e+03155  112       1530
4.792796  146  200  518       163  89.4     1370
4.825100  120  173  760       145  78.2     3080
4.812892  124  164  463       146  87.9     1410
4.822098.2135  193  699       156  92.9     1580
4.8329100  135  186  800       157  89.8     2900
4.842590.8129  166  534       145  86.3     1040
4.85495  124  194  550       149  86.1     1510
4.8645100  139  198  950       161  96       2680
4.872690  124  179  440       139  68       1450
4.8847100  125  194  1.19e+03158  101       2980
4.893896.5139  180  550       156  90.6     2510
4.931100  131  186  500       154  77.4     2160
4.9127100  140  190  500       157  79.7     1970
4.9230100  135  185  900       160  97.5     2420
4.9350109  150  200  2e+03       172  146       2200
4.9435106  144  200  745       171  104       1760
4.9550102  142  200  504       164  82.6     1400
4.9628100  136  196  399       157  80.2     1220
4.974399  139  199  350       154  67.8     930
4.984595  136  198  950       183  160       440
4.9980110  139  156  279       146  61.6     90
510101  140  190  1.92e+03163  104       34040

By drawing a regression line between the median price and the ratings, we concluded that they share a positive correlation, which means higher ratings have higher median price, and vice versa.

We will use two different methods for this, first we will use the median for each rating creating a smaller subset of datapoints:

##Create the data of price and score rating
median_per_rating <- listings2 %>% 
  
  #Group by review
  group_by(review_scores_rating) %>% 
  
  #Summarize by median price
  summarise(median_price = median(price)) %>% 
  
  #Exclude NA's 
  na.omit()


#Plotting
ggplot(median_per_rating,
       
       #Review scores vs median price
       aes(x = review_scores_rating,
           y= median_price)) +
  
  #Coloring red dots
  geom_point(color="red") +
  
  #Adding useful titles
  labs(title = "Median price per night vs. Ratings",
    x="Ratings",
    y="Median Price"
    ) +
  
  #Adding a trend line
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simple theme
  theme_bw() +
  
  ylim(0, 200) + 
  NULL

We see some relationship between price and ratings. Let’s explore this further and see if this depends on the room type.

#Plotting by rating but log of price
ggplot(listings2, aes(x = review_scores_rating, y = log(price), color = room_type)) + 
  
  #In red color 
  geom_point(alpha = 0.5) + 
  
  #And adding a trendline
  geom_smooth(se = F, method = "lm", color = "black") + 
  
  #In minimal Theme
  theme_bw() + 
  
    #Adding useful titles
  labs(title = "Log price per night vs. Ratings",
    x="Ratings",
    y="Log Price"
    ) +
  
  facet_wrap(~ room_type, scales = "free") + 
  
  guides(color=F) + 
  
  NULL

Interestingly, especially hotel rooms are sensitive to review ratings.

3.3.5 Superhosts

Let’s first look at a key summary

#Analyse the distribution of price of listings depends on the rating score
favstats(price~host_is_superhost,data=listings2)
host_is_superhostminQ1medianQ3maxmeansdnmissing
FALSE4991301808e+03154126  97670
TRUE278812718090015092.616340

Interestingly, non-superhosts seem to have a higher price than superhosts. Let’s plot some relationships to explore this further.

#Create the data of price, score rating, and superhost
median_per_rating1 <- listings2 %>% 
  
  #grouping by review score
  group_by(review_scores_rating) %>% 
  
  #Summarizing by price and superhost
  summarise(median_price = median(price),
            host_is_superhost=host_is_superhost) %>% 
  
  #Excluding NA's
  na.omit()

#Plotting the dataset 
ggplot(median_per_rating1,
       
       #By same as above but including host is superhost
       aes(x = review_scores_rating,
           y= median_price, 
           colour = host_is_superhost)) +
  
  #Adding scatterplot
  geom_point() +
  
  #Adding useful labels
  labs(title = "Median price per night vs. Ratings",
    x="Ratings",
    y="Median Price"
    ) +
  
  #Adding Trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding a simple theme
  theme_bw() +
  
  #Adding a legend
  guides(color=guide_legend(title="Host is Superhost"))+
  
  #Wrapping by host type to get two different charts
  facet_wrap(~host_is_superhost) +
  
  NULL

Although both mean and median price for listings by non-superhost were higher than that by superhost, the regression analysis above shows that the correlation between ratings and median prices is higher for superhost. Therefore, the median price of listings by superhost is more affected by the change in ratings than that by non-superhost.

Now let’s also take a look at the effect from superhosts themselves:

#Plotting by host type
ggplot(listings2,
       
       #Adding key variables
       aes(x=host_is_superhost,
           y=log(price), 
           fill = host_is_superhost))+
  
  #Looking at a boxplot for distribution
  geom_boxplot()+
  
  #Adding useful labels
  labs(title = "Superhost versus non superhosts price distributions",
    x="Superhost",
    y=" Log Price"
    )+
  
  #Removing legend
  guides(fill = F) + 
  
  #Simple Theme
  theme_bw() + 
  
  NULL

With more extreme outliers, the box plot of non-superhosts shows higher standard deviation compared to that of superhost.Let’s confirm this:

#Plot the density graph
ggplot(listings2,
       aes(x=log(price), 
           y = host_is_superhost, 
           fill = host_is_superhost))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by host type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

Based on the graph above, the distribution of superhost seems to have a greater variation. However, the distribution of superhost has a lower standard deviation than that of non-superhost. Therefore, we believe that the graph above is greatly influenced by the number of samples included, i.e. 1,634 superhost vs 9,767 non-superhost.

3.3.6 Number of reviews

Similarly we can graph a relationship between number of reviews and price.

As this is a numerical variable with many ranges we will not use favstats to look into this.

ggplot(listings2, 
       
       #Plotting reviews by price
       aes(x = number_of_reviews,
           y= log(price))) +
  
  #Adding color to the chart
  geom_point(color="red", alpha = 0.3) +
  
  #Adding useful labels
  labs(title = "Price per night vs. number of Reviews",
    x="Number of Reviews",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding BW Theme
  theme_bw() +
  
  NULL

We see that as the number of reviews increases the price falls.

Let’s check if this holds across categorical variables

ggplot(listings2, 
       
       #Plotting reviews by price
       aes(x = number_of_reviews,
           y= log(price),
           color = room_type)) +
  
  #Adding color to the chart
  geom_point(alpha = 0.5) +
  
  #Adding useful labels
  labs(title = "Price per night vs. number of Reviews",
    x="Number of Reviews",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Adding BW Theme
  theme_bw() +
  
  #Wrapping by host type to get two different charts
  facet_wrap(~room_type, scales ="free") + 
  
  guides(color = F) + 
  
  NULL

This is super interesting to see because we see that the relationship depends on room types. Therefore it is important to control for room type when looking at the effect of number of reviews!

3.3.7 License

Let’s first look at a summary of the data:

#Analyse the distribution of price of listings depends on the rating score
favstats(price~license,data=listings2)

licenseminQ1medianQ3maxmeansdnmissing
no4991291758e+03       15113080310
yes25961361951.19e+0315910133700
We see that the listings with license have higher mean and median prices.

By graphing a boxplot, we see that properties with license have higher Q1, median and Q3 value than properties without license. Hence, we conclude that properties with licence are more expensive than the others.

#Plotting the data
ggplot(listings2,
       aes(x=license,
           y=log(price)))+
  
  #In Wheat Color
  geom_boxplot(fill="wheat2")+
  
  #Useful labels
  labs(title = "Price per night vs. License",
    x="License",
    y=" Price"
    )+
  
  #BW Theme
  theme_bw()+ 
  
  NULL

We see that there is a higher spread on prices of properties without licenses and a slightly lower median price.

3.3.8 Room type

#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)
room_typeminQ1medianQ3maxmeansdnmissing
Entire home/apt41101471998e+03169  128  87310
Hotel room2683120156900138  135  800
Private room1765851152e+0399.574  25570
Shared room2037658233686  71.5330

There seems to be an extremely strong relationship between room type and price. This is good for our analysis but let’s first plot it further.

First let’s visualize the median price:

#Calculate the median price of different room types
median_per_room <- listings2 %>% 
  
  #By Room
  group_by(room_type) %>% 
  
  #In median Prices
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_room,
       
       #Reorder by price
       aes(x = reorder(room_type, median_price),
           
           #Price on Y Axis
           y= median_price)) +
  
  #Column Chart
  geom_col(fill = "skyblue") +
  
  #Useful Labels
  labs(
    title = "Median price by room_type",
    x="Room Type",
    y="Median Price"
    ) +
  
  #Flipping for aesthetics
  coord_flip() +
  
  #Simple theme for aesthetics
  theme_bw() +
  NULL

We see that entire homes and hotel rooms are especially expensive. We would like to plot the density of this to explore it further:

#Density Plot
ggplot(listings2,
       aes(x=log(price), 
           y = room_type, 
           fill = room_type))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by room type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

Interestingly although entire homes or apartments are the most expensive in terms of median, we see that some hotels are also extremely expensive.

We also see that these distributions are multimodal - i.e. prices are clustered around specific levels.

#Using above dataset for the confidence interval calculations
formula_ci <- listings2%>% 
  group_by(room_type) %>% 
  
  #Calculate weight's summary statistics for people exercising at least 3 times a week 
  
  # calculate mean, SD, count, SE, lower/upper 95% CI
  summarise(
    average_price=mean(price,na.rm=TRUE), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(price,na.rm=TRUE), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 

formula_ci
room_typeaverage_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
Entire home/apt169  128  87311.961.372.69167  172
Hotel room138  135  801.9915.1 30   108  168
Private room99.574  25571.961.462.8796.6102
Shared room86  71.5332.0412.4 25.4 60.6111
ggplot(formula_ci, aes(x=average_price, y=room_type, color=room_type)) +

#geom_errorbar function allows us to show the two bars with confidence intervals

geom_errorbar(aes(xmin=price_low, xmax=price_high),width = 0.1, size=0.5)+ 
  
geom_point(aes(x=average_price),size=1)+

  
theme_bw() +
  
theme(legend.position = "none",axis.title.y=element_blank())+
  
  labs(title = "Confidence interval for average price per room type",
       subtitle = "95% confidence intervals overlap",
       x = "Average Price"
       ) +
  NULL

The confidence intervals’ lengths are very different between each room type. Shared and hotel rooms have wide widths. There is a big uncertainty related to their average price. On the other side, we can estimate quite accurately the entire and private homes’ average.

3.3.9 Property Type

Let us look into the statistics first:

#Analyse the distribution of price of listings depends on the rating score
favstats(price~room_type,data=listings2)
room_typeminQ1medianQ3maxmeansdnmissing
Entire home/apt41101471998e+03169  128  87310
Hotel room2683120156900138  135  800
Private room1765851152e+0399.574  25570
Shared room2037658233686  71.5330

There seems to be a very strong relationship between property type and price. Lets plot the medians again:

#Calculate the median price of different room types
median_per_prop <- listings2 %>% 
  group_by(prop_type_simplified) %>% 
  summarise(median_price = median(price))

#Plot the graph
ggplot(median_per_prop,
       aes(x = reorder(prop_type_simplified, median_price),
           y= median_price)) +
  
  #Column type
  geom_col(fill = "skyblue") +
  
  #Useful labels
  labs(
    title = "Median price by property type",
    x="Room Type",
    y="Median Price"
    ) +
  
  #Flipping for aesthetics
  coord_flip() +
  
  #Simple theme
  theme_bw() +
  NULL

This is very useful but let’s also take a look at the distribution:

#Density Plot
ggplot(listings2,
       aes(x=log(price), 
           y = prop_type_simplified, 
           fill = prop_type_simplified))+
  
  #Density Ridges Plit
  geom_density_ridges(alpha=0.3) +    
  
  #Useful lables
  labs(      
      title = "Prices by property type",
      x = "Price per night (log)",
      y = "Density") + 
      theme_classic() + 
  
  #Title of Legend
  guides(fill=F)+ 
  
  #Plotting a vertical line at the median
  geom_vline(xintercept = median(log(listings2$price)), colour = "#001e62")+
  NULL

There is a very distinct pattern between property types and this will likely be a useful regressor for our analysis.

3.3.10 Avaialability next 30 days

Let’s also look at availability next 30 days which we hypothesize will be an indicator of popularity.

#Summarizing median price by availability
listings_availability<-listings2 %>% 
  group_by(availability_30) %>% 
  summarize(price=median(price))

#Plotting scatterplot of availability 
ggplot(listings_availability, 
       aes(x = availability_30,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red") +
  
  #Useful Labels
  labs(title = "Log Price Per Night versus Availability",
    x="Availability next 30 days",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

The more availability, the higher the price, which indicates that more premium rooms are booked less often or are less often fully booked.

This goes against our initial thought that less availability would be popular rooms and therefore more expensive! Good that we looked into this.

3.3.11 Instant Bookable

Let’s also look at whether or not units are instantly bookable

#Analyse the distribution of price of listings depends on the rating score
favstats(price~instant_bookable,data=listings2)
instant_bookableminQ1medianQ3maxmeansdnmissing
FALSE41001301802e+0315494.788690
TRUE25891251758e+03150188  25320

Instant bookable seems to be slightly higher distributed. Let’s take a look at the graph:

#Boxplotting instant bookable vs log prices
ggplot(listings2,aes(x=instant_bookable,y=log(price)))+
  
  #Boxplot format
  geom_boxplot(fill="wheat2")+
  
  #Useful labels
  labs(title = "Distribution of prices depending on whether or not units are instantly bookable",
    x="Instantly Bookable",
    y="Log Price"
    )+
  
  #Simople Theme
  theme_bw()

There seems to be little to no difference between whether or not the units are bookable, however, we will still look further into this in the regression.

3.3.12 Host Since

Let’s also look at when they have been host since:

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = host_since,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red", alpha = 0.2) +
  
  #Useful Labels
  labs(title = "Log price per night versus how long the host has been active on Airbnb",
    x="Host since (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

There seems to be little to no relationship here and we will likely not need to consider it going forward.

Let’s check across categoris

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = host_since,
           y= log(price), 
           color = prop_type_simplified)) +
  
  #Points in red
  geom_point(alpha = 0.5) +
  
  #Useful Labels
  labs(title = "Log price per night versus how long the host has been active on Airbnb",
    x="Host since (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  
  facet_wrap(~prop_type_simplified, scales = "free") + 
  
  guides(color = F) + 
  
  NULL

This seems to hold true across a variety of categories and we will not consider it going forward.

3.3.13 Last Review

Let’s also look at whether last review date has an impact

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review,
           y= log(price))) +
  
  #Points in red
  geom_point(color="red", alpha = 0.2) +
  
  #Useful Labels
  labs(title = "Log price per night versus when the last review was",
    x="Last Review (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  #Simplified theme
  theme_bw() +
  NULL

We see a slight relationship between price and the date of last review. We also noticed that some properties do not have a recent review probably due to covid-19 outbreak.

We will plot the density for reference

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review)) + 
  
  geom_density(fill = "blue", alpha = 0.2) + 
  
  #Useful Labels
  labs(title = "Distributions of last reviews",
    x="Last Review (date)",
    ) +

  #Simplified theme
  theme_bw() +
  NULL

Although many properties did not receive a review during the Covid-19 outbreak, we will still include these data in our analysis.

#Plotting scatterplot of availability 
ggplot(listings2, 
       aes(x = last_review,
           y= log(price), 
           color = prop_type_simplified)) +
  
  #Points in red
  geom_point(alpha = 0.5) +
  
  #Useful Labels
  labs(title = "Log price per night versus when the last review was",
    x="Last Review (date)",
    y="Log Price"
    ) +
  
  #Adding a trendline
  geom_smooth(method=lm,colour="black",alpha=0)+
  
  facet_wrap(~prop_type_simplified, scales = "free", ncol = 3) + 
  
  #Simplified theme
  theme_bw() +
  
  guides(color =F) + 
  NULL

We see that this relationship holds across a variety of properties.

3.3.14 Paired relationship overview

We plotted a GG pairs plot to get a comprehensive overview of the relationship between the price of listings and the selected variables.

We only plot the numerical variables where we can get a correlation and scatterplot of the relationship

#Relatonship between the price and variables, starting with numerical variables
listings2 %>% 
  
  #Choosing our variables
  select(price, #Variable we are trying to explain
         accommodates, 
         review_scores_rating, 
         number_of_reviews, 
         minimum_nights, 
         reviews_per_month, 
         bedrooms, 
         beds, 
         availability_30,
         last_review) %>% 
  
  #Size and aesthetics
  ggpairs(aes(alpha=0.2))+
  theme_minimal(base_size=8)

From the above we see that for price the most correlated variables is the number of accomodates or bedrooms (although these are correlated with 0.73 and will likely not both be included). However, from here it seems that perhaps it is better to focus on accommodates rather than bedrooms.

It is worth noting that some of the less correlated variables seem to be review scores, number of reviews, however, we will still try and test these in our regression.

Unfortunately, GGPairs does not work well with categorical variables and we will use that data we extracted above as an indicator for whether or not these variables will be useful going forward. `

4 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights. We first create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

4.1 Predicted Variable

We assume that this leaves us with Airbnbs accommodating 2 or more people. I.e. we filter out Airbnb’s which accomodate 1 person.

Also, we assume that no other people would go to the airbnb lowering the cost per person (i.e. 2 people could rent a 10 people Airbnb but would be the only two to pay).

Lets create the cost for two people and filter for the apartments that have 2 people min:

##Fist, we have to filter the data to represent 2 people staying for 4 nights.
listings_4 <- listings2 %>% 
  filter(accommodates>=2) %>% 
  mutate(price_4_nights =price*4)

We take logarithm of ’price_4_nights‘ and create the density plot, because it looks more like the normal distribution. See below:

# Plot density of price using price_4_nights

#Using price 4 nights without log
ggplot(listings_4,aes(x=price_4_nights))+
  
  #In a density plot
  geom_density() +   
  
  #Useful labels
  labs(      
      title = "Density plot for prices for 4 nights",
      x = "price per night (log)",
      y = "Density") + 
  
  #Simple Theme
      theme_classic() + 
  
    NULL

# Plot density of price using log(price_4_nights)
ggplot(listings_4,aes(x=log(price_4_nights)))+
  
  #In density Plot
  geom_density() +   
  
  #usfeul Labs
  labs(      
      title = "Density plot for prices for 4 nights",
      x = "price per night (log)",
      y = "Density") + 
  
  #Sumple Theme
      theme_classic() + 
  
    NULL

We see that taking the log of prices brings us much closer to a normal distribution why we use log prices going forward.

This is the same conclusion we made above here we just plotted the distribution for 4 nights and after adjusting our dataset throughout the above code.

4.2 Models

4.2.1 Model 1: property type, number of reviews and the rating scores

We first fit a regression model with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

First, let us split the data into a training and testing set. The training will be used for the models and the testing later on:

#Set seed for regeneration purposes
set.seed(123456)


#Splitting 75% into a training set
train_test_split <- initial_split(listings_4, prop = 0.75)

#New variable name for training set
listings_train <- training(train_test_split)

#New Variable name for test set 
listings_test <- testing(train_test_split)

Lets now create the model using the initial variables:

#Create the model1
model1 <- lm(log(price_4_nights) ~ #Predicted Variable
               
               #Explanatory Variables
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating,
             
             #Dataset
             data=listings_train)

#Summary of Model 1
summary(model1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5932 -0.2978 -0.0354  0.2593  4.0634 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.836e+00  7.097e-02  82.232
prop_type_simplifiedEntire residential home      2.252e-01  2.056e-02  10.958
prop_type_simplifiedEntire townhouse             2.923e-01  3.130e-02   9.339
prop_type_simplifiedOther                       -6.992e-02  1.362e-02  -5.135
prop_type_simplifiedPrivate room in rental unit -5.324e-01  1.654e-02 -32.188
number_of_reviews                               -5.425e-04  8.414e-05  -6.447
review_scores_rating                             1.107e-01  1.480e-02   7.483
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                       2.88e-07 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                               1.20e-10 ***
review_scores_rating                            8.03e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4467 on 8356 degrees of freedom
Multiple R-squared:  0.1707,    Adjusted R-squared:  0.1701 
F-statistic: 286.6 on 6 and 8356 DF,  p-value: < 2.2e-16

We come to an adjusted R squared of 0.1707 which is a good start. We see that all variables have significant t-values indicating they are significantly different from 0.

More specifically these values changes as the category changes from entire rental unit (the variable excluded)

  • Entire residential home increases log of price 0.2
  • Entire Townhouse increases log of price 0.28
  • Other decreases log of price 0.007
  • Private room decreases log of price 0.53

These numbers are compared to if the unit is entire rental Unit which is the variable excluded (Categorical Variables exclude one variable)

For review scores rating we see that as rating goes up 1 unit log of price goes up 0.019.

However, as number of reviews goes up price goes down. For every 100 additional reviews price falls roughly 0.05 (Likely because those properties are less premium).

We recall from earlier that reviews should account for room type which is included now:

4.2.2 Model 2: Room Type

Now lets add the room type

#Create the model2
model2 <- lm(log(price_4_nights) ~ #Predicted Variable
               
               #Explanatory Variables
               prop_type_simplified +
               number_of_reviews +
               review_scores_rating + 
               room_type,
             
             #Dataset
             data=listings_train)

#Summary of Model 2
summary(model2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5898 -0.2870 -0.0362  0.2469  4.0502 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.970e+00  6.894e-02  86.594
prop_type_simplifiedEntire residential home      2.269e-01  1.990e-02  11.405
prop_type_simplifiedEntire townhouse             2.953e-01  3.030e-02   9.745
prop_type_simplifiedOther                        2.389e-01  1.871e-02  12.768
prop_type_simplifiedPrivate room in rental unit  1.185e-02  2.810e-02   0.422
number_of_reviews                               -1.949e-04  8.281e-05  -2.354
review_scores_rating                             8.157e-02  1.438e-02   5.673
room_typeHotel room                             -2.579e-01  6.379e-02  -4.043
room_typePrivate room                           -5.622e-01  2.385e-02 -23.569
room_typeShared room                            -6.104e-01  1.064e-01  -5.737
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      < 2e-16 ***
prop_type_simplifiedEntire townhouse             < 2e-16 ***
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit   0.6732    
number_of_reviews                                 0.0186 *  
review_scores_rating                            1.45e-08 ***
room_typeHotel room                             5.33e-05 ***
room_typePrivate room                            < 2e-16 ***
room_typeShared room                            9.96e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4324 on 8353 degrees of freedom
Multiple R-squared:  0.2231,    Adjusted R-squared:  0.2223 
F-statistic: 266.6 on 9 and 8353 DF,  p-value: < 2.2e-16

We see that hotel rooms, private rooms and shared rooms all lower the price compared to entire home/apt.

  • Hotels lower log price -0.25
  • Private rooms lower log price -0.56
  • Shared rooms lower log price -0.61

Our model has higher Adjusted Rsquared but some variables are no longer significant.

Lets now test for VIF given our model has some p-value issues:

car::vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.162167  4        1.195129
number_of_reviews    1.141835  1        1.068567
review_scores_rating 1.020372  1        1.010135
room_type            4.292148  3        1.274811

Let’s keep the variables in given they are below 5 but there is some overlap between property type and room type.

We may have to account for this later on.

Also we earlier thought about accounting for number of reviews across properties types (e.g. by including the product of the two) but we will look into this relationship later once we confirm whether or not to keep property type as a variable.

4.2.3 Model 3: Bedrooms, beds and accomodates

We want to further determine whether the number ofbedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights. Lets add it to our model.

We keep in mind that these might be correlated and will test VIF after.

#New model
model3_1 <- lm(
  
  #Predicted Variable
  log(price_4_nights) ~ 
                 
    
                #Explanatory Variables
                 prop_type_simplified+
                 number_of_reviews+
                 review_scores_rating +
                 room_type + 
                 beds +  
                 bedrooms + 
                 accommodates, 
               
            #dataset
               data = listings_train)

summary(model3_1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type + beds + bedrooms + accommodates, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4339 -0.2347 -0.0146  0.2135  4.0939 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.308e+00  6.297e-02  84.295
prop_type_simplifiedEntire residential home     -7.360e-02  1.903e-02  -3.868
prop_type_simplifiedEntire townhouse             5.799e-03  2.790e-02   0.208
prop_type_simplifiedOther                        1.478e-01  1.669e-02   8.855
prop_type_simplifiedPrivate room in rental unit -3.664e-02  2.490e-02  -1.472
number_of_reviews                               -2.342e-04  7.353e-05  -3.186
review_scores_rating                             1.106e-01  1.278e-02   8.652
room_typeHotel room                             -7.229e-02  5.660e-02  -1.277
room_typePrivate room                           -3.881e-01  2.144e-02 -18.106
room_typeShared room                            -4.355e-01  9.433e-02  -4.616
beds                                             9.345e-03  5.935e-03   1.574
bedrooms                                         1.234e-01  9.302e-03  13.272
accommodates                                     1.165e-01  6.016e-03  19.368
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home      0.00011 ***
prop_type_simplifiedEntire townhouse             0.83537    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit  0.14117    
number_of_reviews                                0.00145 ** 
review_scores_rating                             < 2e-16 ***
room_typeHotel room                              0.20157    
room_typePrivate room                            < 2e-16 ***
room_typeShared room                            3.97e-06 ***
beds                                             0.11541    
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3828 on 8350 degrees of freedom
Multiple R-squared:  0.3914,    Adjusted R-squared:  0.3906 
F-statistic: 447.6 on 12 and 8350 DF,  p-value: < 2.2e-16

We notice that beds are not significant and will take that out.

#New model
model3_2 <- lm(
  
  #Predicted Variable
  log(price_4_nights) ~ 
                 
    
                #Explanatory Variables
                 prop_type_simplified+
                 number_of_reviews+
                 review_scores_rating +
                 room_type + 
                 bedrooms + 
                 accommodates, 
               
            #dataset
               data = listings_train)

summary(model3_2)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type + bedrooms + accommodates, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4334 -0.2340 -0.0144  0.2132  4.0992 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.302e+00  6.285e-02  84.349
prop_type_simplifiedEntire residential home     -7.151e-02  1.898e-02  -3.767
prop_type_simplifiedEntire townhouse             8.377e-03  2.786e-02   0.301
prop_type_simplifiedOther                        1.489e-01  1.667e-02   8.929
prop_type_simplifiedPrivate room in rental unit -3.620e-02  2.490e-02  -1.454
number_of_reviews                               -2.283e-04  7.344e-05  -3.108
review_scores_rating                             1.105e-01  1.278e-02   8.648
room_typeHotel room                             -7.243e-02  5.661e-02  -1.279
room_typePrivate room                           -3.874e-01  2.143e-02 -18.074
room_typeShared room                            -4.307e-01  9.429e-02  -4.568
bedrooms                                         1.291e-01  8.582e-03  15.042
accommodates                                     1.212e-01  5.245e-03  23.100
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home     0.000166 ***
prop_type_simplifiedEntire townhouse            0.763655    
prop_type_simplifiedOther                        < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 0.145961    
number_of_reviews                               0.001889 ** 
review_scores_rating                             < 2e-16 ***
room_typeHotel room                             0.200763    
room_typePrivate room                            < 2e-16 ***
room_typeShared room                               5e-06 ***
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3828 on 8351 degrees of freedom
Multiple R-squared:  0.3912,    Adjusted R-squared:  0.3904 
F-statistic: 487.9 on 11 and 8351 DF,  p-value: < 2.2e-16

We also notice that some variables are no longer significant.

Let’s check VIF

vif(model3_2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 5.152846  4        1.227456
number_of_reviews    1.145956  1        1.070493
review_scores_rating 1.028391  1        1.014096
room_type            4.429142  3        1.281504
bedrooms             2.473767  1        1.572821
accommodates         2.182146  1        1.477209

We now see quite a high variance inflation factor for Prop_type_simplified, and Room Type. This makes sense as the room type (i.e. hotel room) is related to property type (i.e. a hotel)

Let’s try and run and see each model without one another.

#Same model but without Property Type
model3_3 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates, 
  
  data = listings_train)

#Same model but without Room Type
model3_4 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    prop_type_simplified+
    number_of_reviews+
    review_scores_rating + 
    bedrooms + 
    accommodates, 
  
  data = listings_train)

vif(model3_3)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.139493  1        1.067471
review_scores_rating 1.026732  1        1.013278
room_type            1.229365  3        1.035015
bedrooms             2.207683  1        1.485827
accommodates         2.163928  1        1.471029
vif(model3_4)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.430238  4        1.045746
number_of_reviews    1.109832  1        1.053486
review_scores_rating 1.019848  1        1.009875
bedrooms             2.458383  1        1.567923
accommodates         2.173088  1        1.474140

We immidieatly lower our variance inflation factor.

Now let’s try and decide which model is stronger going forward.

summary(model3_3)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4421 -0.2413 -0.0190  0.2125  4.0866 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.2844695  0.0634856  83.239  < 2e-16 ***
number_of_reviews     -0.0001524  0.0000741  -2.057  0.03968 *  
review_scores_rating   0.1155040  0.0129204   8.940  < 2e-16 ***
room_typeHotel room    0.0628398  0.0552016   1.138  0.25500    
room_typePrivate room -0.3499489  0.0116290 -30.093  < 2e-16 ***
room_typeShared room  -0.2964919  0.0941722  -3.148  0.00165 ** 
bedrooms               0.1187920  0.0082029  14.482  < 2e-16 ***
accommodates           0.1266656  0.0052843  23.970  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3873 on 8355 degrees of freedom
Multiple R-squared:  0.3765,    Adjusted R-squared:  0.376 
F-statistic: 720.9 on 7 and 8355 DF,  p-value: < 2.2e-16
summary(model3_4)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + bedrooms + accommodates, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4258 -0.2358 -0.0135  0.2199  4.1128 

Coefficients:
                                                  Estimate Std. Error t value
(Intercept)                                      5.174e+00  6.369e-02  81.235
prop_type_simplifiedEntire residential home     -9.289e-02  1.932e-02  -4.808
prop_type_simplifiedEntire townhouse            -1.365e-02  2.838e-02  -0.481
prop_type_simplifiedOther                       -5.977e-02  1.191e-02  -5.017
prop_type_simplifiedPrivate room in rental unit -4.037e-01  1.468e-02 -27.502
number_of_reviews                               -4.579e-04  7.372e-05  -6.212
review_scores_rating                             1.313e-01  1.298e-02  10.115
bedrooms                                         1.409e-01  8.726e-03  16.141
accommodates                                     1.267e-01  5.338e-03  23.737
                                                Pr(>|t|)    
(Intercept)                                      < 2e-16 ***
prop_type_simplifiedEntire residential home     1.55e-06 ***
prop_type_simplifiedEntire townhouse               0.631    
prop_type_simplifiedOther                       5.36e-07 ***
prop_type_simplifiedPrivate room in rental unit  < 2e-16 ***
number_of_reviews                               5.48e-10 ***
review_scores_rating                             < 2e-16 ***
bedrooms                                         < 2e-16 ***
accommodates                                     < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3905 on 8354 degrees of freedom
Multiple R-squared:  0.3665,    Adjusted R-squared:  0.3659 
F-statistic:   604 on 8 and 8354 DF,  p-value: < 2.2e-16

We see that model 3_2 (incl. room type) provides better explanatory power than 3_3 (incl. property type), and will utilize that going forward.

This model has an adjusted r square of 0.3659 and adds variables beds, bedrooms and accommodates.

  • Each Bedroom adds 0.118 to the log price
  • Each Accommodates increase adds 0.126 to the log price.

We earlier commented that the relationship of number of ratings depends on room type so we analyze a model looking into the product of the two:

#Same model but without Property Type
model3_3_product <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates + 
    room_type*number_of_reviews, 
  
  data = listings_train)

#Same model but without Room Type
summary(model3_3_product)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + room_type * number_of_reviews, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4407 -0.2371 -0.0164  0.2117  4.0862 

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                              5.2747982  0.0635659  82.982   <2e-16
number_of_reviews                        0.0002605  0.0001480   1.761   0.0783
review_scores_rating                     0.1162390  0.0129210   8.996   <2e-16
room_typeHotel room                      0.1253254  0.0628741   1.993   0.0463
room_typePrivate room                   -0.3346790  0.0128508 -26.043   <2e-16
room_typeShared room                    -0.2845573  0.1142397  -2.491   0.0128
bedrooms                                 0.1210258  0.0082253  14.714   <2e-16
accommodates                             0.1250238  0.0053038  23.573   <2e-16
number_of_reviews:room_typeHotel room   -0.0011738  0.0004459  -2.632   0.0085
number_of_reviews:room_typePrivate room -0.0005238  0.0001718  -3.049   0.0023
number_of_reviews:room_typeShared room  -0.0004791  0.0011559  -0.415   0.6785
                                           
(Intercept)                             ***
number_of_reviews                       .  
review_scores_rating                    ***
room_typeHotel room                     *  
room_typePrivate room                   ***
room_typeShared room                    *  
bedrooms                                ***
accommodates                            ***
number_of_reviews:room_typeHotel room   ** 
number_of_reviews:room_typePrivate room ** 
number_of_reviews:room_typeShared room     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3871 on 8352 degrees of freedom
Multiple R-squared:  0.3775,    Adjusted R-squared:  0.3767 
F-statistic: 506.5 on 10 and 8352 DF,  p-value: < 2.2e-16

This input does not offer addiitonal explanatory power and includes 3 additional variables so we will not consider it going forward.

4.2.4 Model4: Superhosts

We want to explore whether the superhosts (host_is_superhost) command a pricing premium, after controlling for other variables.

Let us build this model on model 3.2

#Model 3.2 incl. superhost 
model4 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates+ 
    host_is_superhost, 
  
  data = listings_train)

summary(model4)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4313 -0.2353 -0.0155  0.2141  4.0933 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.369e+00  6.382e-02  84.134  < 2e-16 ***
number_of_reviews     -3.479e-04  7.665e-05  -4.538 5.76e-06 ***
review_scores_rating   9.681e-02  1.301e-02   7.441 1.10e-13 ***
room_typeHotel room    1.667e-02  5.514e-02   0.302  0.76237    
room_typePrivate room -3.705e-01  1.178e-02 -31.455  < 2e-16 ***
room_typeShared room  -2.850e-01  9.370e-02  -3.041  0.00236 ** 
bedrooms               1.220e-01  8.168e-03  14.931  < 2e-16 ***
accommodates           1.239e-01  5.266e-03  23.523  < 2e-16 ***
host_is_superhostTRUE  1.239e-01  1.332e-02   9.303  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3854 on 8354 degrees of freedom
Multiple R-squared:  0.3829,    Adjusted R-squared:  0.3823 
F-statistic:   648 on 8 and 8354 DF,  p-value: < 2.2e-16

The variable is significant however, it only has very little explanatory power.

A superhost adds 0.123 to the log price.

Let’s try and compute the VIF once again to check out model

vif(model4)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.232011  1        1.109960
review_scores_rating 1.051819  1        1.025582
room_type            1.282096  3        1.042286
bedrooms             2.211537  1        1.487124
accommodates         2.170987  1        1.473427
host_is_superhost    1.210964  1        1.100438

We see that VIF is below 5 for all variables meaning the variables do not have too high correlation

4.2.5 Model 5, Immidiate bookings

  1. Some hosts allow you to immediately book their listing. Let’s see if affter controlling for other variables, is instant_bookable a significant predictor of price_4_nights
#Model 4.0 incl. immidiate bookings 
model5 <- lm(
  
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews+
    review_scores_rating + 
    room_type+
    bedrooms + 
    accommodates+ 
    host_is_superhost+ 
    instant_bookable, 
  
  data = listings_train)

summary(model5)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4236 -0.2342 -0.0153  0.2152  4.0506 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            5.334e+00  6.409e-02  83.227  < 2e-16 ***
number_of_reviews     -3.623e-04  7.659e-05  -4.730 2.28e-06 ***
review_scores_rating   1.028e-01  1.304e-02   7.881 3.65e-15 ***
room_typeHotel room   -1.783e-02  5.547e-02  -0.322  0.74783    
room_typePrivate room -3.811e-01  1.194e-02 -31.919  < 2e-16 ***
room_typeShared room  -2.811e-01  9.356e-02  -3.005  0.00267 ** 
bedrooms               1.245e-01  8.171e-03  15.235  < 2e-16 ***
accommodates           1.220e-01  5.270e-03  23.154  < 2e-16 ***
host_is_superhostTRUE  1.180e-01  1.335e-02   8.840  < 2e-16 ***
instant_bookableTRUE   5.513e-02  1.071e-02   5.149 2.68e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3848 on 8353 degrees of freedom
Multiple R-squared:  0.3849,    Adjusted R-squared:  0.3842 
F-statistic: 580.7 on 9 and 8353 DF,  p-value: < 2.2e-16

Here we see instant bookable is a significant predictor which will can be used in our model.

If a hotel is instantly bookable log price goes up by 0.055

We once again control VIF

vif(model5)
                         GVIF Df GVIF^(1/(2*Df))
number_of_reviews    1.233658  1        1.110702
review_scores_rating 1.060229  1        1.029674
room_type            1.336159  3        1.049485
bedrooms             2.219462  1        1.489786
accommodates         2.181078  1        1.476847
host_is_superhost    1.219980  1        1.104527
instant_bookable     1.101576  1        1.049560

Still these factors are not correlated enough for us to be worried.

4.2.6 Model 6: Neighboughoods

For this section we have not chosen the neighbourhood_overview variables as it had a lot of NAs. We use the categories we set up earlier which split neighbourhoods into the most costly and least costly area.

We therefore expect it to have high explanatory power.

#Model 5 + neighbourhoods
model6 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed, 
  
  #Dataset
  data = listings_train)

summary(model6)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5398 -0.2242 -0.0157  0.2040  4.0793 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.048e+00  6.347e-02  79.527  < 2e-16 ***
number_of_reviews               -4.719e-04  7.316e-05  -6.450 1.18e-10 ***
review_scores_rating             1.101e-01  1.244e-02   8.845  < 2e-16 ***
room_typeHotel room             -2.245e-03  5.295e-02  -0.042  0.96618    
room_typePrivate room           -3.705e-01  1.151e-02 -32.186  < 2e-16 ***
room_typeShared room            -2.567e-01  8.927e-02  -2.875  0.00405 ** 
bedrooms                         1.329e-01  7.804e-03  17.032  < 2e-16 ***
accommodates                     1.193e-01  5.029e-03  23.728  < 2e-16 ***
host_is_superhostTRUE            1.099e-01  1.274e-02   8.632  < 2e-16 ***
instant_bookableTRUE             5.557e-02  1.022e-02   5.435 5.62e-08 ***
neighbourhood_cleansedTop 11-15  8.619e-02  1.936e-02   4.453 8.60e-06 ***
neighbourhood_cleansedTop 5      3.658e-01  1.739e-02  21.034  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.274e-01  1.757e-02  12.939  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.367 on 8350 degrees of freedom
Multiple R-squared:  0.4406,    Adjusted R-squared:  0.4398 
F-statistic: 548.1 on 12 and 8350 DF,  p-value: < 2.2e-16

We see that neighbourgood categorization is a really impactful driver on our prices as our adjusted r square went to ca. 44%.

We see that compared to the bottom 7 neighbourhoods the prices are impacted as such:

  • Top 5 areas add 0.366 to log prices
  • Top 6-10 areas add 0.227 to log prices
  • Top 11-15 areas add 0.086 to prices

Testing VIF

vif(model6)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.237386  1        1.112378
review_scores_rating   1.060844  1        1.029973
room_type              1.367241  3        1.053515
bedrooms               2.225889  1        1.491941
accommodates           2.182781  1        1.477424
host_is_superhost      1.220813  1        1.104904
instant_bookable       1.103987  1        1.050708
neighbourhood_cleansed 1.043262  3        1.007084

Still VIF looks fine.

4.2.7 Model 7: Availability

Next lets look at availability in the coming 30 days

#Model 6 + 30 day availability
model7 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30, 
  
  #Dataset
  data = listings_train)

summary(model7)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30, 
    data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5106 -0.2170 -0.0067  0.1992  4.1299 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.029e+00  6.070e-02  82.851  < 2e-16 ***
number_of_reviews               -6.802e-04  7.035e-05  -9.668  < 2e-16 ***
review_scores_rating             1.103e-01  1.190e-02   9.268  < 2e-16 ***
room_typeHotel room             -9.324e-02  5.073e-02  -1.838  0.06612 .  
room_typePrivate room           -4.307e-01  1.122e-02 -38.403  < 2e-16 ***
room_typeShared room            -2.978e-01  8.538e-02  -3.488  0.00049 ***
bedrooms                         1.453e-01  7.476e-03  19.437  < 2e-16 ***
accommodates                     1.061e-01  4.832e-03  21.952  < 2e-16 ***
host_is_superhostTRUE            8.877e-02  1.220e-02   7.274 3.80e-13 ***
instant_bookableTRUE             3.716e-02  9.798e-03   3.793  0.00015 ***
neighbourhood_cleansedTop 11-15  9.902e-02  1.852e-02   5.348 9.15e-08 ***
neighbourhood_cleansedTop 5      3.715e-01  1.663e-02  22.338  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.453e-01  1.682e-02  14.590  < 2e-16 ***
availability_30                  1.385e-02  4.950e-04  27.979  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3509 on 8349 degrees of freedom
Multiple R-squared:  0.4886,    Adjusted R-squared:  0.4878 
F-statistic: 613.5 on 13 and 8349 DF,  p-value: < 2.2e-16

Once again, this variable is a significant explanatory variable. As availability over the next 30 days goes up with 1 log prices increase by 0.014

Let’s test for VIF once again:

vif(model7)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.251401  1        1.118660
review_scores_rating   1.060844  1        1.029973
room_type              1.422204  3        1.060458
bedrooms               2.233719  1        1.494563
accommodates           2.203964  1        1.484575
host_is_superhost      1.225527  1        1.107035
instant_bookable       1.108985  1        1.053084
neighbourhood_cleansed 1.046609  3        1.007621
availability_30        1.141229  1        1.068283

All VIFS are low and therefore we make no changes.

4.2.8 Model 8: License

#Model 7 + License
model8 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30 + 
    license, 
  
  #Dataset
  data = listings_train)

summary(model8)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30 + 
    license, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5100 -0.2168 -0.0067  0.1993  4.1306 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.031e+00  6.091e-02  82.598  < 2e-16 ***
number_of_reviews               -6.854e-04  7.128e-05  -9.615  < 2e-16 ***
review_scores_rating             1.098e-01  1.195e-02   9.190  < 2e-16 ***
room_typeHotel room             -9.457e-02  5.082e-02  -1.861 0.062800 .  
room_typePrivate room           -4.310e-01  1.124e-02 -38.361  < 2e-16 ***
room_typeShared room            -2.979e-01  8.538e-02  -3.488 0.000488 ***
bedrooms                         1.450e-01  7.500e-03  19.340  < 2e-16 ***
accommodates                     1.060e-01  4.833e-03  21.943  < 2e-16 ***
host_is_superhostTRUE            8.767e-02  1.244e-02   7.048 1.95e-12 ***
instant_bookableTRUE             3.688e-02  9.817e-03   3.757 0.000173 ***
neighbourhood_cleansedTop 11-15  9.889e-02  1.852e-02   5.340 9.56e-08 ***
neighbourhood_cleansedTop 5      3.713e-01  1.663e-02  22.322  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.452e-01  1.682e-02  14.578  < 2e-16 ***
availability_30                  1.379e-02  5.100e-04  27.047  < 2e-16 ***
licenseyes                       4.356e-03  9.586e-03   0.454 0.649569    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.351 on 8348 degrees of freedom
Multiple R-squared:  0.4886,    Adjusted R-squared:  0.4877 
F-statistic: 569.6 on 14 and 8348 DF,  p-value: < 2.2e-16

License is not a significant predictor of price when controlling for remaining variables. Let’s go to the next variable

4.2.9 Model 9: Last Review

#Model 7 + Last Review
model9 <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    instant_bookable + 
    neighbourhood_cleansed+ 
    availability_30 + 
    last_review, 
  
  #Dataset
  data = listings_train)

summary(model9)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    instant_bookable + neighbourhood_cleansed + availability_30 + 
    last_review, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4853 -0.2127 -0.0085  0.1972  4.1747 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.772e+00  1.422e-01  26.527  < 2e-16 ***
number_of_reviews               -7.711e-04  7.058e-05 -10.926  < 2e-16 ***
review_scores_rating             9.725e-02  1.191e-02   8.167 3.61e-16 ***
room_typeHotel room             -9.117e-02  5.045e-02  -1.807 0.070777 .  
room_typePrivate room           -4.327e-01  1.116e-02 -38.785  < 2e-16 ***
room_typeShared room            -2.818e-01  8.492e-02  -3.319 0.000909 ***
bedrooms                         1.435e-01  7.436e-03  19.293  < 2e-16 ***
accommodates                     1.056e-01  4.805e-03  21.968  < 2e-16 ***
host_is_superhostTRUE            6.673e-02  1.234e-02   5.407 6.60e-08 ***
instant_bookableTRUE             2.061e-02  9.889e-03   2.084 0.037160 *  
neighbourhood_cleansedTop 11-15  9.869e-02  1.841e-02   5.360 8.56e-08 ***
neighbourhood_cleansedTop 5      3.695e-01  1.654e-02  22.341  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.450e-01  1.672e-02  14.653  < 2e-16 ***
availability_30                  1.269e-02  5.063e-04  25.056  < 2e-16 ***
last_review                      7.466e-05  7.646e-06   9.764  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.349 on 8348 degrees of freedom
Multiple R-squared:  0.4943,    Adjusted R-squared:  0.4935 
F-statistic: 582.9 on 14 and 8348 DF,  p-value: < 2.2e-16

Last review is a significant predicotr of price and for each day of more recent reviews the prices are 0.00007 higher.

4.3 Diagnostics, collinearity, summary tables

4.3.1 VIF

Let’s check the VIF for our final model

vif(model9)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.273594  1        1.128536
review_scores_rating   1.074327  1        1.036497
room_type              1.423269  3        1.060591
bedrooms               2.235151  1        1.495042
accommodates           2.204224  1        1.484663
host_is_superhost      1.267912  1        1.126016
instant_bookable       1.142555  1        1.068904
neighbourhood_cleansed 1.047138  3        1.007706
availability_30        1.207846  1        1.099021
last_review            1.278776  1        1.130830

Variance inflation factor is below 5 for all variables which shows no sign of autocorrelation.

4.3.2 Residual values and normal distribution

Let’s also plot the model using the autoplot to check for potential issues:

autoplot(model9)

For our model we see that resudual errors are evenly distributed and do not change as the variable fitted values change. This is a good sign for our model.

Futhermore we see from the theoretical quantiles that our model is approximately evenly distributed but not 100% evenly distributed.

Our final model has an r-squared of 0.494 and excludes the license categorical variable (Y/N), beds (numerical), and property type.

4.3.3 Overfitting test

We now test our model for overfitting

#To get the RMSE from the training set
rmse_train <- listings_train %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model9, .)) %>% 
  select(predictions,price_4_nights) %>% 
  mutate(squared_error = (predictions - log(price_4_nights))^2) %>% 
  summarise(rmse = sqrt(mean(squared_error))) %>% 
  
  #Printing RMSE
  pull()
rmse_train
[1] 0.348655
#Repeated for test set
rmse_test <- listings_test %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model9, .)) %>% 
  select(predictions,price_4_nights) %>% 
  mutate(squared_error = (predictions - log(price_4_nights))^2) %>% 
  summarise(rmse = sqrt(mean(squared_error))) %>% 
  
  #Printing RMSE
  pull()
rmse_test
[1] 0.3491043

There seems to be no overfitting in our model. Let’s see if we can adjust the final model by taking out some of the less key variables without loosing explanatory power.

Specifically last review and instant bookable seemed to have little effects on the explanatory power

4.4 Adjusting final model for unneccessary variables

#Model final adjusted without instant_bookable and without last review:
model_final <- lm(
  
  #Variable to predict
  log(price_4_nights) ~ 
    
    #Explanatory Variables
    number_of_reviews +
    review_scores_rating + 
    room_type +
    bedrooms + 
    accommodates + 
    host_is_superhost + 
    neighbourhood_cleansed+ 
    availability_30, 

  #Dataset
  data = listings_train)

summary(model_final)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    neighbourhood_cleansed + availability_30, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5161 -0.2156 -0.0072  0.1992  4.1598 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.054e+00  6.040e-02  83.669  < 2e-16 ***
number_of_reviews               -6.726e-04  7.038e-05  -9.557  < 2e-16 ***
review_scores_rating             1.063e-01  1.186e-02   8.961  < 2e-16 ***
room_typeHotel room             -7.094e-02  5.043e-02  -1.407 0.159563    
room_typePrivate room           -4.244e-01  1.110e-02 -38.235  < 2e-16 ***
room_typeShared room            -3.011e-01  8.544e-02  -3.525 0.000427 ***
bedrooms                         1.437e-01  7.470e-03  19.236  < 2e-16 ***
accommodates                     1.072e-01  4.827e-03  22.203  < 2e-16 ***
host_is_superhostTRUE            9.250e-02  1.217e-02   7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15  9.865e-02  1.853e-02   5.323 1.05e-07 ***
neighbourhood_cleansedTop 5      3.709e-01  1.664e-02  22.287  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.436e-01  1.682e-02  14.480  < 2e-16 ***
availability_30                  1.397e-02  4.942e-04  28.275  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4869 
F-statistic: 662.4 on 12 and 8350 DF,  p-value: < 2.2e-16

We still keep quite a high explanatory power of 0.487% so let’s see again the RMSE

#To get the RMSE from the training set
rmse_train <- listings_train %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model_final, .)) %>% 
  select(predictions,price_4_nights) %>% 
  mutate(squared_error = (predictions - log(price_4_nights))^2) %>% 
  summarise(rmse = sqrt(mean(squared_error))) %>% 
  
  #Printing RMSE
  pull()
rmse_train
[1] 0.3509421
#Repeated for test set
rmse_test <- listings_test %>% 
  
  #Calculating RMSE
  mutate(predictions = predict(model_final, .)) %>% 
  select(predictions,price_4_nights) %>% 
  mutate(squared_error = (predictions - log(price_4_nights))^2) %>% 
  summarise(rmse = sqrt(mean(squared_error))) %>% 
  
  #Printing RMSE
  pull()
rmse_test
[1] 0.3512415

Our RMSE is still low which is good

4.5 Final Model

First let us summarize the final model:

summary(model_final)

Call:
lm(formula = log(price_4_nights) ~ number_of_reviews + review_scores_rating + 
    room_type + bedrooms + accommodates + host_is_superhost + 
    neighbourhood_cleansed + availability_30, data = listings_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5161 -0.2156 -0.0072  0.1992  4.1598 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.054e+00  6.040e-02  83.669  < 2e-16 ***
number_of_reviews               -6.726e-04  7.038e-05  -9.557  < 2e-16 ***
review_scores_rating             1.063e-01  1.186e-02   8.961  < 2e-16 ***
room_typeHotel room             -7.094e-02  5.043e-02  -1.407 0.159563    
room_typePrivate room           -4.244e-01  1.110e-02 -38.235  < 2e-16 ***
room_typeShared room            -3.011e-01  8.544e-02  -3.525 0.000427 ***
bedrooms                         1.437e-01  7.470e-03  19.236  < 2e-16 ***
accommodates                     1.072e-01  4.827e-03  22.203  < 2e-16 ***
host_is_superhostTRUE            9.250e-02  1.217e-02   7.599 3.31e-14 ***
neighbourhood_cleansedTop 11-15  9.865e-02  1.853e-02   5.323 1.05e-07 ***
neighbourhood_cleansedTop 5      3.709e-01  1.664e-02  22.287  < 2e-16 ***
neighbourhood_cleansedTop 6-10   2.436e-01  1.682e-02  14.480  < 2e-16 ***
availability_30                  1.397e-02  4.942e-04  28.275  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3512 on 8350 degrees of freedom
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4869 
F-statistic: 662.4 on 12 and 8350 DF,  p-value: < 2.2e-16

The model has an adjusted r square and explanatory power of 48.69%.

Let’s check for issues with residuals.

autoplot(model_final)

Again we see there are no issues with residuals or with normal distribution.

Let’s check VIF as well

vif(model_final)
                           GVIF Df GVIF^(1/(2*Df))
number_of_reviews      1.250398  1        1.118212
review_scores_rating   1.052602  1        1.025964
room_type              1.376315  3        1.054677
bedrooms               2.226404  1        1.492114
accommodates           2.195945  1        1.481872
host_is_superhost      1.217543  1        1.103423
neighbourhood_cleansed 1.044653  3        1.007307
availability_30        1.136085  1        1.065873

Also, again we still see no issues.

Lets plot our models

huxreg(list(
  "Prop Type, Reviews, Rating" = model1, 
  "+ Room Type" = model2, 
  "+ Beds + Bedrooms + Accom." = model3_1, 
  "- Beds" = model3_2, 
  "- Prop Type" = model3_3, 
  " - Room Type + Prop Type"  = model3_4, 
  " + No Reviews*Room_type"  = model3_3_product, 
  "+ Superhost" = model4, 
  "+ immidiate bookings" = model5, 
  "+ Neighbourhoods" = model6, 
  "+ Availability" = model7, 
  "+ license" = model8, 
  "+ last review" = model9,
  "FINAL (- instant bookable - last_review)" = model_final))
Prop Type, Reviews, Rating+ Room Type+ Beds + Bedrooms + Accom.- Beds- Prop Type - Room Type + Prop Type + No Reviews*Room_type+ Superhost+ immidiate bookings+ Neighbourhoods+ Availability+ license+ last reviewFINAL (- instant bookable - last_review)
(Intercept)5.836 ***5.970 ***5.308 ***5.302 ***5.284 ***5.174 ***5.275 ***5.369 ***5.334 ***5.048 ***5.029 ***5.031 ***3.772 ***5.054 ***
(0.071)   (0.069)   (0.063)   (0.063)   (0.063)   (0.064)   (0.064)   (0.064)   (0.064)   (0.063)   (0.061)   (0.061)   (0.142)   (0.060)   
prop_type_simplifiedEntire residential home0.225 ***0.227 ***-0.074 ***-0.072 ***        -0.093 ***                                                                
(0.021)   (0.020)   (0.019)   (0.019)           (0.019)                                                                   
prop_type_simplifiedEntire townhouse0.292 ***0.295 ***0.006    0.008            -0.014                                                                    
(0.031)   (0.030)   (0.028)   (0.028)           (0.028)                                                                   
prop_type_simplifiedOther-0.070 ***0.239 ***0.148 ***0.149 ***        -0.060 ***                                                                
(0.014)   (0.019)   (0.017)   (0.017)           (0.012)                                                                   
prop_type_simplifiedPrivate room in rental unit-0.532 ***0.012    -0.037    -0.036            -0.404 ***                                                                
(0.017)   (0.028)   (0.025)   (0.025)           (0.015)                                                                   
number_of_reviews-0.001 ***-0.000 *  -0.000 ** -0.000 ** -0.000 *  -0.000 ***0.000    -0.000 ***-0.000 ***-0.000 ***-0.001 ***-0.001 ***-0.001 ***-0.001 ***
(0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   (0.000)   
review_scores_rating0.111 ***0.082 ***0.111 ***0.111 ***0.116 ***0.131 ***0.116 ***0.097 ***0.103 ***0.110 ***0.110 ***0.110 ***0.097 ***0.106 ***
(0.015)   (0.014)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   (0.012)   (0.012)   
room_typeHotel room        -0.258 ***-0.072    -0.072    0.063            0.125 *  0.017    -0.018    -0.002    -0.093    -0.095    -0.091    -0.071    
        (0.064)   (0.057)   (0.057)   (0.055)           (0.063)   (0.055)   (0.055)   (0.053)   (0.051)   (0.051)   (0.050)   (0.050)   
room_typePrivate room        -0.562 ***-0.388 ***-0.387 ***-0.350 ***        -0.335 ***-0.371 ***-0.381 ***-0.371 ***-0.431 ***-0.431 ***-0.433 ***-0.424 ***
        (0.024)   (0.021)   (0.021)   (0.012)           (0.013)   (0.012)   (0.012)   (0.012)   (0.011)   (0.011)   (0.011)   (0.011)   
room_typeShared room        -0.610 ***-0.435 ***-0.431 ***-0.296 **         -0.285 *  -0.285 ** -0.281 ** -0.257 ** -0.298 ***-0.298 ***-0.282 ***-0.301 ***
        (0.106)   (0.094)   (0.094)   (0.094)           (0.114)   (0.094)   (0.094)   (0.089)   (0.085)   (0.085)   (0.085)   (0.085)   
beds                0.009                                                                                            
                (0.006)                                                                                           
bedrooms                0.123 ***0.129 ***0.119 ***0.141 ***0.121 ***0.122 ***0.124 ***0.133 ***0.145 ***0.145 ***0.143 ***0.144 ***
                (0.009)   (0.009)   (0.008)   (0.009)   (0.008)   (0.008)   (0.008)   (0.008)   (0.007)   (0.007)   (0.007)   (0.007)   
accommodates                0.117 ***0.121 ***0.127 ***0.127 ***0.125 ***0.124 ***0.122 ***0.119 ***0.106 ***0.106 ***0.106 ***0.107 ***
                (0.006)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   (0.005)   
number_of_reviews:room_typeHotel room                                                -0.001 **                                                         
                                                (0.000)                                                           
number_of_reviews:room_typePrivate room                                                -0.001 **                                                         
                                                (0.000)                                                           
number_of_reviews:room_typeShared room                                                -0.000                                                            
                                                (0.001)                                                           
host_is_superhostTRUE                                                        0.124 ***0.118 ***0.110 ***0.089 ***0.088 ***0.067 ***0.093 ***
                                                        (0.013)   (0.013)   (0.013)   (0.012)   (0.012)   (0.012)   (0.012)   
instant_bookableTRUE                                                                0.055 ***0.056 ***0.037 ***0.037 ***0.021 *          
                                                                (0.011)   (0.010)   (0.010)   (0.010)   (0.010)           
neighbourhood_cleansedTop 11-15                                                                        0.086 ***0.099 ***0.099 ***0.099 ***0.099 ***
                                                                        (0.019)   (0.019)   (0.019)   (0.018)   (0.019)   
neighbourhood_cleansedTop 5                                                                        0.366 ***0.371 ***0.371 ***0.369 ***0.371 ***
                                                                        (0.017)   (0.017)   (0.017)   (0.017)   (0.017)   
neighbourhood_cleansedTop 6-10                                                                        0.227 ***0.245 ***0.245 ***0.245 ***0.244 ***
                                                                        (0.018)   (0.017)   (0.017)   (0.017)   (0.017)   
availability_30                                                                                0.014 ***0.014 ***0.013 ***0.014 ***
                                                                                (0.000)   (0.001)   (0.001)   (0.000)   
licenseyes                                                                                        0.004                    
                                                                                        (0.010)                   
last_review                                                                                                0.000 ***        
                                                                                                (0.000)           
N8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        8363        
R20.171    0.223    0.391    0.391    0.377    0.366    0.377    0.383    0.385    0.441    0.489    0.489    0.494    0.488    
logLik-5123.513    -4850.132    -3829.296    -3830.537    -3930.350    -3997.392    -3924.011    -3887.254    -3874.003    -3476.954    -3102.201    -3102.097    -3054.720    -3109.400    
AIC10263.026    9722.264    7686.591    7687.074    7878.700    8014.784    7872.023    7794.508    7770.007    6981.909    6234.401    6236.195    6141.439    6246.800    
*** p < 0.001; ** p < 0.01; * p < 0.05.

We see now we have significant variables and the final model has less variables but still decent explanatory power (48.8%) and similar RMSE in the test and training set.

Interestingly, our model only predicts roughly 50% of price variations. The remaining 50% is explained by variables not included here.

50% does seem high given the final model is only using 8 different variables with some of the being categorical.

In sum our model:

  • Has explanatory power of 48.8%
  • Has not overfitting issues and RMSE is similar in the testing and training set both around 0.35
  • Has no correlation issues with all VIF factors scoring below 2.5
  • Has no difference in the residuals as the fitted values increase
  • Is approximately following the normal distribution.

4.6 Model test on Airbnb’s

We will now try to use our model to predict prices for staying for 4 nights, in a private room, with at least 10 reviews, and reviews above 4.98 (We use 4.98 to lower sample size)

First let us define our target segment

#Utilizing the full dataset
targets <- listings_4 %>%
  
  #BY Private Room
  filter(room_type == "Private room", 
         
         #At least 10 reviews
         number_of_reviews >= 10, 
         
         #A score above 4.9
         review_scores_rating >= 4.98) %>%
  
  #Predicted values in log
  mutate(log_predicted_values = predict(model_final, .),
         
         #Predicted Values Nominal
         prediction = exp(log_predicted_values)) 

targets
host_sincehost_is_superhostneighbourhood_cleansedroom_typeaccommodatespricereview_scores_ratingnumber_of_reviewsminimum_nightsreviews_per_monthbedroomsbedsavailability_30last_reviewlicenseinstant_bookableprop_type_simplifiedprice_4_nightslog_predicted_valuesprediction
2011-09-09FALSETop 5Private room22455   1020.421112020-01-21yesFALSEOther980       5.9 364
2013-01-07TRUETop 5Private room21354.985120.8611162020-03-13yesFALSEOther540       6.17478
2013-08-04TRUETop 5Private room21554.997031.1 1102020-03-02yesFALSEOther620       5.93378
2013-08-27TRUETop 11-15Private room2454.984340.441102018-06-01yesTRUEOther180       5.68293
2013-10-19TRUETop 5Private room2895   1120.141102015-10-22noFALSEOther356       5.97393
2014-06-28TRUETop 6-10Private room2705   1320.421102019-05-12noFALSEPrivate room in rental unit280       5.85346
2014-08-01FALSETop 5Private room21265   1430.531172018-09-13yesTRUEOther504       5.98395
2012-09-16TRUETop 5Private room41254.9813532.0712182021-09-01yesFALSEPrivate room in rental unit500       6.36575
2016-01-07TRUETop 6-10Private room2754.9856928.361122021-08-15yesTRUEPrivate room in rental unit300       5.5 244
2013-10-30FALSETop 5Private room22405   1620.261102017-08-20yesFALSEOther960       5.88357
2013-09-08TRUETop 5Private room2805   1130.451212019-10-06yesFALSEPrivate room in rental unit320       5.99399
2014-10-02FALSETop 11-15Private room2805   1020.271102018-10-07noFALSEPrivate room in rental unit320       5.61273
2016-05-03FALSERemainingPrivate room2475   1620.641182019-07-20noFALSEOther188       5.62276
2016-07-15TRUETop 5Private room21225   1510.2811222019-06-30yesFALSEOther488       6.28534
2015-08-11TRUETop 5Private room21144.9824324.4 11132021-05-02yesTRUEOther456       6   403
2014-09-12TRUETop 5Private room21184.9813623.3411132021-08-22yesFALSEOther472       6.07433
2015-04-30TRUETop 5Private room2904.9921333.981192021-07-26yesTRUEPrivate room in rental unit360       5.96389
2016-05-21TRUETop 5Private room22705   3221.231102019-10-26yesFALSEOther1.08e+035.96388
2017-04-17TRUETop 5Private room21954.9918123.5311122021-08-16yesFALSEPrivate room in rental unit780       6.03414
2014-07-21FALSETop 5Private room21215   2120.491102019-06-19noFALSEPrivate room in rental unit484       5.88356
2014-07-29TRUETop 5Private room21195   6621.311102019-10-08yesTRUEOther476       5.94379
2017-01-10FALSERemainingPrivate room2805   1220.241102017-09-17noFALSEPrivate room in rental unit320       5.51247
2014-06-03TRUERemainingPrivate room2614.9811122.321102021-08-16yesFALSEPrivate room in rental unit244       5.53253
2014-07-16FALSETop 6-10Private room2655   1010.211102017-11-03noFALSEPrivate room in rental unit260       5.76316
2014-01-28TRUETop 6-10Private room21104.9812222.671102020-02-29yesFALSEPrivate room in rental unit440       5.77321
2013-04-16FALSETop 6-10Private room2695   1420.341102019-03-30noFALSEPrivate room in rental unit276       5.75315
2017-10-10FALSETop 5Private room41805   1210.262202018-02-24noFALSEPrivate room in rental unit720       6.24513
2017-12-05TRUETop 11-15Private room2804.997721.871172019-09-13yesTRUEPrivate room in rental unit320       5.75316
2017-12-22FALSETop 6-10Private room41504.985631.2523272019-06-10yesFALSEOther600       6.46638
2012-05-11TRUETop 5Private room42385   3330.912272021-07-26yesFALSEOther952       6.42611
2014-07-20TRUERemainingPrivate room21255   1010.2511202018-07-25yesFALSEOther500       5.88359
2014-06-23TRUETop 6-10Private room2804.9812022.9 1102020-01-11yesTRUEOther320       5.77321
2014-04-10FALSETop 5Private room2775   2420.631102018-10-28noFALSEOther308       5.87355
2014-10-08TRUETop 5Private room21405   2620.641102020-08-25yesTRUEOther560       5.96389
2013-08-22TRUETop 5Private room21104.9914213.471162019-12-26yesTRUEPrivate room in rental unit440       5.97391
2014-11-13TRUETop 5Private room21275   2830.6811282019-09-04yesFALSEOther508       6.35575
2016-06-14FALSETop 11-15Private room2855   1010.251202019-09-18noTRUEPrivate room in rental unit340       5.61273
2012-05-23FALSETop 6-10Private room22005   1330.3211272018-08-20noFALSEPrivate room in rental unit800       6.13460
2014-04-15FALSETop 5Private room2605   2030.611102019-10-21yesTRUEOther240       5.88356
2015-06-15TRUETop 6-10Private room2745   1510.522102019-09-22noFALSEOther296       5.99399
2015-01-24TRUETop 11-15Private room21395   1210.421102019-05-06noTRUEPrivate room in rental unit556       5.7 299
2015-04-02FALSETop 5Private room21645   2120.6211122019-12-04yesFALSEOther656       6.04421
2018-08-23TRUETop 5Private room41265   9012.8 1262021-05-14yesTRUEOther504       6.22503
2015-02-12TRUETop 5Private room21394.997122.161102019-12-02yesTRUEOther556       5.93377
2018-09-05TRUETop 5Private room21155   1920.621102019-06-09noTRUEOther460       5.97391
2016-03-13FALSETop 6-10Private room21305   1210.3811302019-05-09noTRUEPrivate room in rental unit520       6.17480
2016-06-12TRUERemainingPrivate room2924.985921.8111192019-11-11yesTRUEOther368       5.83342
2015-12-15TRUETop 6-10Private room41564.999833.2212102021-08-13yesTRUEOther624       6.14465
2012-06-10TRUETop 11-15Private room21204.985521.7311232019-10-18yesFALSEOther480       5.99400
2013-12-23TRUETop 11-15Private room21104.985721.9911182019-12-08yesFALSEOther440       5.92373
2016-08-04TRUETop 6-10Private room2984.984321.3211202019-11-04yesFALSEOther392       6.1 447
2016-02-18TRUETop 11-15Private room2585   1220.391102019-05-03noTRUEPrivate room in rental unit232       5.7 299
2015-05-19FALSETop 5Private room2895   1520.461212021-08-31yesTRUEPrivate room in rental unit356       5.89363
2018-12-23TRUETop 5Private room31995   2922.1912112021-09-03yesTRUEOther796       6.22504
2014-03-09TRUETop 5Private room42075   2120.8612172019-08-21yesTRUEPrivate room in rental unit828       6.42614
2019-01-26TRUETop 11-15Private room21255   2230.7612132019-11-16yesTRUEOther500       5.88357
2014-08-17FALSERemainingPrivate room2455   1030.321102020-01-02noFALSEOther180       5.51248
2015-10-13TRUETop 5Private room2755   1830.631102019-10-20yesFALSEOther300       5.97392
2015-07-02TRUETop 6-10Private room2775   2020.6711132020-02-23yesFALSEPrivate room in rental unit308       6.02413
2012-12-28TRUETop 5Private room2985   1630.561192021-07-04yesTRUEOther392       6.1 445
2013-09-08TRUETop 5Private room2885   3411.241102020-01-27noFALSEPrivate room in rental unit352       5.96387
2019-04-04TRUETop 6-10Private room41864.985822.1322222020-09-20yesFALSEOther744       6.48651
2015-10-16FALSETop 5Private room2894.984521.9 11172021-08-29yesTRUEOther356       6.09444
2019-03-23TRUETop 5Private room2504.984011.511102020-03-07noFALSEPrivate room in rental unit200       5.95385
2019-04-10TRUETop 11-15Private room2955   2921.011102020-02-17noTRUEPrivate room in rental unit380       5.69296
2019-05-28TRUETop 11-15Private room2505   1820.751102020-03-02noFALSEPrivate room in rental unit200       5.7 298
2014-08-23TRUETop 5Private room2525   1020.391102019-08-12noTRUEPrivate room in rental unit208       5.98394
2014-04-17TRUERemainingPrivate room2605   1220.4911162020-02-03noFALSEPrivate room in rental unit240       5.83339
2017-07-23TRUERemainingPrivate room21365   1120.421102019-08-09noTRUEPrivate room in rental unit544       5.6 271
2015-04-22TRUETop 5Private room21105   1820.781442021-08-01yesFALSEOther440       6.03414
2017-07-16TRUETop 11-15Private room4925   3721.4712182020-02-24yesTRUEOther368       6.15469
2017-07-16TRUETop 11-15Private room4914.984021.5512172020-03-08yesTRUEOther364       6.13461
2013-09-25TRUETop 11-15Private room2855   1120.461102019-09-19noTRUEOther340       5.7 300
2015-01-02TRUETop 5Private room21205   1420.5811172019-11-03yesFALSEOther480       6.21498
2019-07-22TRUETop 5Private room21585   1120.4711222019-09-15yesFALSEPrivate room in rental unit632       6.28535
2017-01-09TRUETop 5Private room21505   1010.4 1102019-11-24noFALSEPrivate room in rental unit600       5.98394
2011-09-15FALSETop 5Private room2805   1510.6 1102020-02-02noFALSEOther320       5.88358
2019-08-02TRUETop 6-10Private room21895   1210.6211262020-02-07yesTRUEOther756       6.21498
2015-06-29TRUETop 6-10Private room2764.984121.691102020-03-10yesFALSEOther304       5.83339
2013-02-26TRUERemainingPrivate room2705   1220.511102019-12-01yesFALSEPrivate room in rental unit280       5.6 271
2013-07-12TRUETop 6-10Private room21605   1521.1411302020-01-04yesFALSEOther640       6.26525
2015-06-29FALSETop 5Private room42205   1318.482202021-08-02yesTRUEOther880       6.24512
2019-11-25FALSETop 5Private room2995   1213.3 1202021-09-01noTRUEOther396       5.88358
2019-12-20TRUETop 6-10Private room21295   1620.831102020-03-08yesFALSEOther516       5.84345
2020-01-29TRUETop 5Private room2795   1820.9512122021-06-05yesTRUEOther316       6.14463
2020-02-02TRUETop 5Private room2945   1831.281102021-09-03yesFALSEPrivate room in rental unit376       5.97392
2020-02-04TRUETop 5Private room21045   2811.471102021-07-18yesTRUEOther416       5.96389
2020-02-28TRUETop 11-15Private room41145   3122.232202021-07-04yesFALSEPrivate room in rental unit456       6.05423
2020-03-09TRUETop 5Private room21385   2121.6111192021-08-07yesTRUEOther552       6.23510
2016-01-26FALSERemainingPrivate room2545   10210   11142020-10-22yesFALSEPrivate room in rental unit216       5.71301
2016-05-25TRUETop 5Private room21695   3212.5512202020-10-08yesFALSEOther676       6.24513
2015-05-14TRUETop 5Private room41375   1424.4222182021-08-03yesFALSEOther548       6.58722
2015-03-04TRUETop 5Private room21585   1615.5811132021-09-05yesTRUEOther632       6.15470

We now have our predictions in place. Lets get the lower and upper CI. To do that we need the Sigma (Residual Standard Eroro)

#Let us take a look at our model

modelsigma <- model_final %>% 
  
  #Glance at key summary
  glance() %>% 
  
  #Select Sigma
  select(sigma)

modelsigma
sigma
0.351

Our Residual STandard error is 0.35. We need to add 2x the residual standard error to get the prediction. Let’s add this to our predicted values.

#Utilizing the full dataset
targets_prediction <- targets %>%
  
  #Adding the Sigma to a column
  mutate(sigma = 0.3512151, 
         
         #Getting the Lower CI in log values
         lower_ci = log_predicted_values - sigma*2,
         
         #Getting the upper CI in log values
         upper_ci = log_predicted_values + sigma*2, 
         
         #Converting Lower  CI to actual values
         LowerCI = exp(lower_ci), 
         
         #Converting Upper CI to actual values
         UpperCI = exp(upper_ci)) %>% 
  
  #Selecting key variables
  select(price_4_nights, prediction, LowerCI, UpperCI)

targets_prediction
price_4_nightspredictionLowerCIUpperCI
980       364180735       
540       478237965       
620       378187762       
180       293145591       
356       393195794       
280       346171698       
504       395196797       
500       5752851.16e+03
300       244121493       
960       357177721       
320       399198805       
320       273135552       
188       276137557       
488       5342641.08e+03
456       403199813       
472       433214874       
360       389193785       
1.08e+03388192783       
780       414205837       
484       356176719       
476       379188765       
320       247123499       
244       253125511       
260       316157638       
440       321159647       
276       315156636       
720       5132541.03e+03
320       316156637       
600       6383161.29e+03
952       6113031.23e+03
500       359178725       
320       321159648       
308       355176718       
560       389193786       
440       391194790       
508       5752851.16e+03
340       273135552       
800       460228928       
240       356177720       
296       399198805       
556       299148604       
656       421209850       
504       5032491.01e+03
556       377187762       
460       391194790       
520       480238969       
368       342169690       
624       465230939       
480       400198808       
440       373185752       
392       447222903       
232       299148604       
356       363180732       
796       5042501.02e+03
828       6143041.24e+03
500       357177720       
180       248123500       
300       392194790       
308       413205833       
392       445220897       
352       387192782       
744       6513231.32e+03
356       444220895       
200       385191777       
380       296147598       
200       298148602       
208       394195795       
240       339168685       
544       271134548       
440       414205836       
368       469232947       
364       461228930       
340       300148605       
480       4982471e+03       
632       5352651.08e+03
600       394195795       
320       358177722       
756       4982471e+03       
304       339168684       
280       271134548       
640       5252601.06e+03
880       5122541.03e+03
396       358178723       
516       345171697       
316       463229935       
376       392194790       
416       389193785       
456       423209854       
552       5102521.03e+03
216       301149608       
676       5132541.04e+03
548       7223581.46e+03
632       470233949       

Although our predicted price is not always correct, we see that the 95% confidence interval covers the price in a majority of outcomes.

We also see that because of our low R squared these intervals are extremely broad.

Just for reference we want to plot these

targets_prediction_plot <- targets_prediction %>% 
  
  pivot_longer(cols = 2:4, 
               names_to = "Interval", 
                values_to = "price")


#Plotting the intervals density
ggplot(targets_prediction_plot, aes(x = price, fill = Interval)) + 
  
  #Predicted prices
       geom_density(alpha = 0.2) + 
  
  #Simple Theme
  theme_bw() + 
  
  labs(title = "Predicted prices versus confidence intervals",
       x = "Predicted Price", 
       y = "Density") +
  
  
  NULL

We see there is a great difference between our predicted price and lower and upper confidence intervals.

This shows us that the model although it explains 48% still has quite high variety in the prices.

I.e. some prices go as high as 1,500

Let’s look at a final summary for referne

#Summarizing statistics
targets_test <- targets %>% 
  
  summarise(
    average_price=mean(prediction), #Mean, we choose to ignore any missing values by setting the 'na.rm = TRUE'
            
    sd_price=sd(prediction), #Standard Deviation
            
    count= n(), #Observations
           
    t_critical = qt(0.975,count-1), #T-Critical at 95% Confidence Interval and these observations
            
    se_price=sd_price/sqrt(count), #Standard Error 
           
    margin_of_error= t_critical*se_price, #Margin of Error
            
    price_low= average_price - margin_of_error, #Lower interval
            
    price_high= average_price + margin_of_error) #Upper Interval 


targets_test
average_pricesd_pricecountt_criticalse_pricemargin_of_errorprice_lowprice_high
40497.8931.9910.120.1383424

The point prediction is the mean equivalent of a price for 2 people for 4 nights of 381.25

5 Acknowledgements

6 Authors

Karim El Asmar, Stepan Emelianenko, Nelly Gray, Ziqi Li, Ray Park, Shirley Wan, Ditlev Meulengracht